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The  variation  in  fault  density  on  Air  Force  programs  is  enormous:  the  worst 
programs  are  390  times  more  error-prone  than  the  best.  Obviously,  there  are 
some  critical  differences  in  these  programs  that  cause  more  errors  to  be 
introduced  or  left  undetected.  If  we  could  solve  the  problem  of  what  these 
differences  are  and  how  to  control  them,  then  we  would  have  learned 
something  fundamental  about  the  occurrence  of  errors  in  software  and  how  to 
avoid  them.  —  ,  „  :  •  -  ^  u  - 

To  increase  our  understanding  of  what  happens  during  a  software  project,  this 
effort  sought  to  discover  empirical  evidence  of  development  process  and 
software  product  variables  that  affect  error  occurrence.  The  starting  point  was 
a  set  of  variables  characterizing  software  quality  that  were  developed  in 
previous  RADC  work.  RADC  used  three  methods  to  gather  data:  reviewing 
published  reports,  examining  software  error  data  bases  from  the  NASA 
Software  Engineering  Laboratory  and  the  RADC  Data  and  Analysis  Center  for 
Software,  and  collecting  information  directly  from  three  software  projects. 
RADC  analyzed  59  projects,  totaling  over  5  million  lines  of  code,  to  refine  the 
initial  set  of  variables  and  obtained  sufficient  evidence  to  recommend  8 
variables  for  use  in  controlling  software  errors. 

Using  these  variables,  RADC  developed  prediction  and  estimation  models  to 
express  software  reliability  in  terms  of  fault  density  (the  number  of  faults  per 
executable  lines  of  code)  and  failure  rate  (the  number  of  failures  during  the 
execution  time  of  a  program).  Through  the  prediction  and  estimation 
techniques,  project  personnel  can  see  what  variables  affect  fault  density  and 
failure  rate  and  can  determine  what  variables  can  be  controlled  in  their 
projects  to  meet  requirements.  During  an  experimental  application  of  the 
predictive  and  estimation  techniques,  there  was  less  than  a  20%  error 
between  the  values  predicted  by  the  techniques  and  what  actually  occurred 
on  a  small  Production  Center-type  application.  Although  the  techniques  are 
by  no  means  validated,  this  result  is  encouraging. 


on  Fop 


In  addition  to  the  predictive  techniques,  RADC  developed  checklists  that  could  ia&i  Jot 
be  applied  throughout  the  life  cycle  to  help  improve  the  quality  of  the  software,  s  u 

The  checklists  are  a  series  of  questions  to  be  answered  at  key  milestone  'ced  □ 

reviews.  Detailed  procedures  were  also  produced  to  show  how  to  measure  cation - 

the  variables  and  apply  the  checklists  and  are  available  in  the  guidebook- - 

companion  to  this  volume.  - — - 
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1.0  INTRODUCTION 


1 . 1  PURPOSE 

The  purpose  of  this  report  is  to  describe  the  results  of  a 
research  and  development  effort  to  develop  a  methodology  for 
predicting  and  estimating  software  reliability.  This  report 
represents  the  final  report  of  the  project.  This  effort  was 
performed  under  Contract  Number  F30602-83-C-0118  for  the  U.S.  Air 
Force  Rome  Air  Development  Center  (RADC) . 


1 . 2  SCOPE 

The  reliability  of  computer-based  systems  (particularly  embedded 
systems)  within  the  Department  of  Defense  (DoD)  has  been  a 
subject  of  considerable  concern  for  a  number  of  years.  For  most 
DoD  systems,  the  reliability  of  the  system  is  critical  to 
effective  mission  performance.  In  the  past,  the  approach  to 
determining  or  predicting  system  reliability  has  been  to  look  at 
the  hardware  components,  calculate  their  combined  reliability, 
assume  software  reliability  was  one,  and  use  the  hardware 
reliability  number  as  the  system  reliability. 

Experience,  however,  has  shown  that  software  is  a  significant 
contributor  to  system  failures.  In  fact,  the  reliability  of 
hardware  components  in  Air  Force  computer  systems  has  improved  to 
a  point  where  software  reliability  is  becoming  the  major  factor 
in  determining  the  overall  system  reliability.  Hardware  relia¬ 
bility  is  a  well-understood  aspect  of  system  engineering,  with 
measures  for  Mean-Time-Between-Failures  and  a  model  dealing  with 
the  aging  of  components. 

Software  reliability  is  a  more  complex  concept  than  hardware 
reliability  and  is  not  understood  nearly  as  well.  Attempts  to 
predict  software  reliability  have  met  with  limited  success. 
Without  an  accepted  predictive  software  reliability  figure-of- 
merit  and/or  software  reliability  estimation  number,  it  is 
impossible  to  determine  the  Impact  of  software  reliability  on 
system  reliability.  This  effort  seeks  to  improve  reliability 
prediction  and  estimation. 

Since  1978,  RADC  has  been  pursuing  a  program  to  achieve  better 
control  of  software  quality.  The  thrust  has  been  threefold.  One 
dimension  of  the  research  centers  around  an  RADC  and  Electronic 
Systems  Division  sponsored  effort  entitled,  "Factors  in  Software 
Quality”  [MCCA77] ,  which  established  a  three-level  hierarchical 
framework  of  software  quality  and  determined  that  software 
quality  can  be  measured  and  predicted  by  the  absence,  presence, 
or  degree  of  some  identifiable  software  product  attributes.  At 
the  top  level  of  the  framework,  user-oriented  factors  that 
contribute  to  software  quality  have  been  defined  (including 


reliability,  correctness,  testability,  maintainability,  flexibil¬ 
ity,  integrity,  reusability,  eto.).  These  factors  were  succeeded 
by  more  software-oriented  criteria  and  metrics  at  the  second  and 
third  levels,  respectively.  Additional  research  sponsored  by 
RADC  and  the  U.S.  Army  Computer  Systems  Command  has-'  (1)  enhanced 
this  framework,  and  (2)  developed  an  Automated  Quality 
Measurement  System  (AMS).  This  work  is  related  to  those  efforts 
by  seeking  to  improve  and  enhance  the  measurement  of  software 
reliability.  The  results  of  the  above  efforts  have  been 
documented  in: 

•  "Software  Reliability  Study",  RADC-TR-76-238  [ THAY76 ] . 

•  "Factors  in  Software  Quality",  RADC-TR-77-369  [MCCA77 ] , 

•  "Software  Quality  Metrics  Enhancement",  RADC-TR-80-109 
[MCCA80] 

•  "Software  Quality  Measurement  for  Distributed  Systems". 
RADC-TR-175  [BOWE83] ,  and 

•  "Specification  of  Software  Quality  Attibutes".  3  Volumes. 
RADC-TR-85-37  [BOWE85] . 

The  RADC  Quality  Measurement  Framework  identifies  four  factors 
that  impact  software  and  system  reliability: 

1.  Software  Reliability  (the  extent  to  which  a  program  can  be 
expected  to  perform  its  intended  function  with  required 
precision) . 

2.  Software  Correctness  (the  extent  to  which  a  program 
satisfies  its  specifications  and  fulfills  the  user's 
mission  objectives). 

3.  Software  Maintainability  (the  effort  required  to  locate 
and  fix  an  error  in  an  operational  program). 

4.  Software  Testability  (the  effort  required  to  validate  the 
specified  software  operation  and  performance). 

These  factors  and  their  associated  criteria  and  metrics  attempt 
to  predict  software  performance  by  measuring  various  attributes 
from  software  code  and  documentation  such  as  the  software's 
consistency,  completeness,  simplicity,  accuracy,  error  tolerance, 
modularity,  etc.  The  measurements  can  be  taken  across  the 
software  development  life-cyole  so  that  an  early  determination  of 
these  qualities  can  be  made. 

A  second  dimension  of  the  research  is  reliability  models.  RADC 
has  been  active  in  developing  and  validating  software  reliability 
estimation  models  such  as  the  Imperfect  Debugging  Model,  the 
Non-homogeneous  Poisson  Process  Model,  the  IBM  Poisson  Model  and 
the  Generalized  Poisson  Model  [GOEL83J .  These  models  analyze 


failure  data  from  software  testing  in  order  to  estimate  the  total 
number  of  software  errors  present  and  the  rate  of  occurrence  at 
which  the  errors  are  being  exposed.  The  models  generally  define 
a  Mean-Time-Between-Failures  (MTBF)  based  on  the  failure  data 
analysis. 

An  RADC-sponsored  survey  lists  24  quantitative  software  reliabil¬ 
ity  models  that  have  been  published  up  to  1979  [DACS79] .  Of 
those,  19  were  primarily  useful  for  estimation  and  five  (5)  were 
primarily  useful  for  prediction.  All  except  one  (1)  of  the 
latter  predicted  an  initial  (usually  interpreted  to  mean  at  start 
of  formal  test)  error  content,  and  by  the  relations  discussed 
below,  this  could  be  translated  into  a  failure  rate  and  thus  be 
transitioned  into  an  estimation  model. 

Practically  all  of  these  models  assume: 

•  A  fixed  initial  number  of  faults  (bugs); 

•  A  failure  rate  of  probability  that  is  positively  corre¬ 
lated  with  the  number  of  faults;  and 

•  The  number  of  faults  will  be  reduced  as  failures  are 
observed  (not  necessarily  on  a  one-to-one  basis). 

In  the  simplest  case,  the  failure  rate  is  proportional  to  the 
number  of  faults,  decreases  by  one  for  every  failure  that  is 
observed,  and  no  new  faults  are  introduced  during  the  correction. 
The  failure  rate  is  designated  by  u(t)  and  the  number  of  faults 
by  E(t) .  Then 

u(t)  -  k  E(t ) ,  (1) 

where  k  is  the  constant  of  the  proportionality.  At  start  of 
formal  test, 

u(O)  -  k  E(0) ,  (2) 

and  after  an  arbitrary  number  of  failures,  C,  have  been  observed 
(by  our  assumptions  exactly  C  faults  have,  therefore,  been 
removed)  and  the  failure  rate  is 

u(l)  -  k  tE(O)  -C] .  (3) 

Since  u(0),  u(l),  and  C  are  known,  k  and  E(0)  can  be  computed  as 


k  -  [u(0)  -  u(l)]/C  (4) 

and  ECO)  -  u(0)  C/[u(0)  -  u(l)]  (5) 


Thus,  the  initial  fault  content  and  the  number  of  remaining 
faults  can  be  obtained.  Also,  because  the  failure  rate  corre¬ 
sponds  to  the  fault  removal  rate 


u 


-dE/dt. 


(6) 


which  can  be  oombined  with  eq.(l)  to  yield 
E(t )  -  B(0)  exp  (-kt) 


In  other  words,  the  fault  content  of  a  program  and  the  failure 
rate  both  approach  zero  exponentially.  The  relations  outlined 
here  can  be  used  primarily  for  reliability  estimation.  It  is 
generally  agreed  that  at  the  start  of  formal  te6t  about  one 
percent  of  all  statements  contain  a  fault  [M0RA76] .  This  was 
also  observed  in  [FISH79] .  If  the  length  of  a  program  (and  hence 
the  initial  fault  content)  is  known,  this  can  be  used  to  predict 
the  initial  failure  rate  through  use  of  eq.(2),  and  the  failure 
rate  at  any  other  time  by  adding  the  relation  in  eq.(7).  Estima¬ 
tion  can  be  based  simply  on  eq.(7)  which  permits  translating  the 
failure  rate  at  one  time  into  the  failure  rate  at  another 
(future)  time. 

Many  of  the  models  described  in  [DACS79]  allow  for  imperfect 
debugging  (not  every  failure  results  in  a  fault  removal,  and  some 
corrections  introduce  additional  faults),  and  these  lead  to  much 
more  complex  mathematical  relations  but  still  yield  an  asymptotic 
approach  to  zero  failure  rate  (e.g.,  [SH0077]). 

Several  of  the  more  widely  used  models  also  remove  the  assumption 
of  a  constant  proportionality  between  fault  content  and  failure 
rate,  thus  making  k  a  variable.  In  particular,  it  is  argued  that 
easy-to-find  faults  are  removed  first,  and  that  the  faults  that 
remain  must  therefore,  be  harder  to  uncover  which  means  that  the 
value  of  k  decreases  as  the  debugging  proceeds  (e.g.,  [GOEL78] , 

[ LITT80 ] ) .  There  is  some  experimental  evidence  that  specific 
fault  types  require  more  runs  to  be  uncovered  than  other  types 
(NAGE82)  and  that  would  support  the  hypothesis  that  k  decreases 
with  time  if  the  environment  remains  unchanged. 

Most  of  the  models  described  in  the  literature  use  data  from 
software  projects  that  were  either  in  test  or  were  operational, 
and  the  parameters  were  fitted  to  the  data  obtained  in  those 
environments.  However,  when  the  models  have  been  applied  to  data 
from  other  environments,  poor  results  were  generally  observe! 
[ SUKE77 ]  ,  [CURT79] ,  [ANG083] ) . 

Thus,  the  objectives  of  the  project  have  not  been  attained  in 
past  efforts.  Yet,  prior  investigations  form  a  good  foundation 
from  which  to  proceed  if  the  lessons  which  they  represent  are 
thoroughly  studied  and  integrated.  The  approach  of  the  present 
project  holds  great  promise  that  significant  improvements  in 
software  reliability  methodology  can  be  obtained  because  (a)  it 
combines  prediction  and  estimation  techniques  over  the  entire 
development  cycle  and  (b)  it  integrates  the  previously  separate! 
efforts  in  reliability  prediction/estimation  and  software  quality 
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A  third  dimension  of  the  researoh,  sponsored  by  RADC ,  has  been  in 
the  area  of  data  oollection.  The  Data  and  Analysis  Center  for 
Software  (DACS)  is  a  data  repository  for  software  developments 
with  the  intent  of  making  that  data  available  for  research 
efforts  such  as  this  (GLOS84J. 

Software  quality  metrics  and  software  reliability  estimation 
models  share  a  common  goal,  i.e.,  predicting  or  estimating 
software  reliability  before  the  software  system  is  placed  into 
operational  use.  Information  concerning  the  early  prediction  of 
software  reliability  can  be  used  by  software  developers  in  making 
software  engineering  decisions  in  constructing  the  software  and 
by  acquisition  managers  in  making  acquisition  and  resource 
planning  decisions.  Part  of  the  motivation  for  both  techniques 
stems  from  the  accepted  conoept  that  the  cost  of  correcting  poor 
reliability  is  far  less  expensive  early  in  the  life-cycle  than 
during  the  operational  phase. 

There  are  many  similarities  between  metrics  and  models;  both  are 
relatively  new,  immature  techniques  that  have  relied  heavily  on 
historical  data,  not  only  for  development,  but  also  for  valida¬ 
tion.  Despite  these  similarities,  there  are  also  important 
differences.  Historically  metrios  and  models  are  applied  at 
completely  different  stages  of  the  development  life-cycle; 
metrics  being  applicable  as  early  as  the  requirement  phase,  and 
the  models  only  after  testing  has  begun,  while  the  metrics 
currently  do  not  use  that  data  at  all.  Models  address  software 
reliability  alone,  while  metrios  can  be  used  to  predict  other 
qualities.  Finally,  metrics  provide  data  at  both  the  software 
system  and  the  module  level;  models  generally  portray  a  system 
perspective.  The  results  of  this  effort  change  this  situation  by 
combining  aspects  of  metrics  and  models  across  the  life-cycle. 

To  adequately  address  software  reliability,  both  the  software 
"product"  and  the  software  development  “process"  must  be  con¬ 
sidered.  In  addition,  both  the  “time-dependence*  and  the 
time -Independence"  aspects  of  reliability  must  also  be  con¬ 
sidered.  It  must  also  be  noted  that  software  reliability  can  be 
realized  in  different  forms,  depending  on  the  software  life-cycle 
stage.  During  the  software  development  life-cycle,  software 
quality  metrics  could  be  used  to  derive  a  Predictive  Software 
Reliability  Flgure-of-Merit  Number,  a  number  calculated  from 
software  characteristics  or  attributes  which  would  make  a 
quantitative  statement  about  future  reliability.  During  Software 
Performance  Testing,  System  Integration  and  Testing,  and 
Operational  Test  and  Evaluation  (OTVE).  a  Reliability  Estimation 
Number  calculated  from  test  data  would  represent  reliability 
during  those  phases.  These  numbers  would  serve  as  indicators  or 
guides  to  software  reliability.  During  Deployment  (or  Operation 
and  Maintenance  (OV M)).  a  final  reliability  assessment  would  be 
made  on  achieved  reliability  based  on  actual  field  data  not  test 
data.  Instead  of  an  indirect  measure  of  reliability,  a 
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Reliability  Assessment  Benchmark  will  involve  direct  observation 
of  software  failures  experienced  by  the  system  in  performing  its 
mission. 

1.3  OBJECTIVES  07  PROJECT 

The  objective  of  this  researoh  and  development  project  is  the 
development  of  a  system-oriented  methodology  that  can  be  used 
directly  for  reliability  prediotion  and  reliability  estimation: 
first  for  software,  and  later  for  the  entire  system. 

The  methodology  must  provide: 

•  Guidance  for  establishing  goals /requirements  for  software 
reliability  at  the  start  of  a  project. 

e  Useful  measurement  of  reliability  during  the  early  phases 
of  the  life-cycle  development  to  permit  effective  correc¬ 
tion  of  potential  faults. 

•  Guidance  for  how  software  reliability  numbers  could  be 
used  for  making  software  engineering  decisions  across  the 
software  development  life-oycle. 

e  A  system-oriented  view  of  embedded  software. 

e  A  transition  bridge  from  the  early  life  cycle  phases  of 
requirements,  design,  and  coding  to  later  phases  of 
operational  testing. 

•  Metrics  that  evaluate  and  correlate  the  quality  factors  in 
the  requirements  and  design  to  the  quality  factors  in  the 
code  and  test  results. 

In  order  to  accomplish  this  goal,  it  is  critical  that  the 
technical  approach  to  developing  this  methodology  take  into 
account  certain  key  considerations.  Those  considerations  are: 

e  The  underlying  system  reliability  characterization  and 
prediction  technique  is  oriented  toward  Software  Acquisi¬ 
tion  Managers,  Air  Foroe  System  Planners,  and  Program 
Offices. 


•  In  order  for  reliability  to  be  built  into  a  system,  the 
above  key  people  must  have  an  early  active  role  in 
assessing  the  quality  and  complexity  of  system  require¬ 
ments  and  design,  and  comparing  the  estimated  or  predicted 
reliability  with  system  requirements  and  goals. 

e  The  methodology  is  a  result  of  synthesis  and  filtering  of 
the  many  current  approaches  to  reliability  prediction  and 
estimation  into  a  system-oriented  procedure  with  a  common 
basis  of  measurement.  A  subset  of  the  past  research  which 
lends  itself  to  merging  the  prediotlve  metric  techniques 
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with  the  reliability  estimation  models  is  used. 

e  Problems  whioh  have  plagued  reliability  research  in  the 
past  and  which  should  be  avoided  to  the  degree  possible 
are:  poor  definitions  in  term  of  units  of  measures; 
incomplete  validation  of  models;  focus  on  testing/ 

debugging  data  rather  than  system  structure;  in  applica¬ 
bility  of  techniques  to  early  life  cycle  phases;  and 
quality  assurance  orientation  rather  than  prediction 
orientation. 

t  To  reduce  data  collection  and  analysis  costs,  the 
potential  for  automating  the  collection  of  the  measures 
and  using  them  to  produce  the  Prediction  S/w  Reliability 
Figure-of-Merit  Number  and  the  Reliability  Estimation 
Number  must  be  considered. 


1.4  APPROACH  OP  PROJBCT 

Figure  1-1  illustrates  the  tasks  performed  during  the  entire 
research  and  development  project. 

The  first  task  involved  establishing  a  framework.  Definitions  of 
the  Reliability  Figure-of-Merit  (prediction)  and  Reliability 
Estimation  Number  (estimation)  were  also  developed.  The  utility 
of  this  approach  to  Air  Force  organizations  was  considered.  An 
interim  report  documented  these  findings.  The  results  are 
described  in  Section  2  of  this  report . 

The  second  task  involved  Identifying  current  measurements  that 
have  potential  within  the  framework  developed  in  task  one.  The 
approach  to  using  these  measurements  was  developed  during  that 
task.  The  candidate  systems  for  data  collection  were  also 
identified  and  preliminary  data  collection  activities.  Including 
discussions  with  practitioners  within  DoD  were  initiated,  a 
Phase  I  final  report  was  documented.  The  results  are  documented 
in  Section  3  of  this  report. 

During  task  three,  new  measurements  were  considered  for  potential 
utility  within  the  framework.  The  concentration  during  this  task 
was  in  early  life-cycle  measurements  and  the  development  of 
procedures  for  calculating  the  reliability  predictors  and 
estimators.  An  interim  report  provided  the  findings  to  date. 
These  results  are  also  provided  in  Section  3. 

During  task  four,  the  methodology  was  refined  by  settling  on  the 
measurements  to  be  used,  determining  how  the  predictive  and 
estimation  numbers  will  be  reported  and  analyzed,  and  how  their 
impact  on  system  reliability  will  be  analyzed.  These  results  are 
in  Section  5  and  6. 

During  task  five,  the  measurements  were  applied  to  several 
systems  in  order  to  validate  their  utility.  The  systems  ohosen 
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for  data  collection  In  the  earlier  tasks  were  used.  Statistical 
analyses  of  the  data  oollected  and  the  results  of  the  application 
of  the  prediction  and  estimation  techniques  have  been  performed. 
A  Phase  II  Pinal  Report  described  the  results  of  tasks  three, 
four,  and  five,  vhioh  comprised  Phase  II  o*  the  project. 
Sections  4  and  5  of  this  report  desoribe  the  results  of  these 
efforts . 

Task  six  (Phase  III)  involved  an  experiment  to  assess  the 
developed  methodology.  The  methodology  was  applied  in  line  with 
a  software  development  and  its  results  assessed.  Section  6  of 
this  report  describes  the  findings  of  this  task.  An  assessment  of 
changes  necessary  to  the  AMS  was  also  made  during  this  ta6k. 
That  assessment  was  documented  in  another  report . 


1.5  ORGAMIEATIOl  OF  REPORT 

This  report  is  organized  in  two  volumes.  Volume  I  contains  the 
findings  of  the  project.  Volume  II  contains  a  Methodology  for 
Predicting  and  Estimating  Software  Reliability  based  on  the 
findings.  The  methodology  is  presented  in  the  form  of  a  guide 
book  to  aid  in  its  application. 

This  section  provides  a  brief  overview  of  the  sections  within 
this  first  volume. 

Section  1  is  the  introduction  describing  the  purpose  of  this 
report,  the  objectives  of  the  research  effort,  some  background 
information,  the  organization  of  the  report,  and  an  executive 
summary. 

Section  2  describes  the  framework  established  in  which  software 
reliability  measurement  will  be  defined.  Definitions  and 
terminology  related  to  this  framework  are  in  Appendix  A. 

Section  3  describes  the  actual  measurements  identified  during  the 
project.  The  process  we  went  through  to  identify  the  measure¬ 
ments  and  filter  a  large  initial  set  to  a  final  set  is  described. 

Section  4  describes  the  data  collected  and  delivered  to  RADC  as  a 
result  of  this  effort.  Further  recommendations  for  data  collec¬ 
tion  and  retention  are  offered. 

Section  3  describes  the  prooess  we  went  through  to  demonstrate 
and  validate  that  these  measurements  were  effective  at  predicting 
and  estimating  reliability.  Those  measurements  that  were 
effective  have  been  retained  in  the  methodology  described  in 
Volume  II.  Those  that  were  not  have  been  either  dropped  or 
retained  for  further  investigation/modification. 

Section  6  describes  the  experiment,  results,  and  identifies  how 
the  methodology  can  assist  users  in  taking  corrective  actions 
during  a  software  development  project. 
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Section  7  Provides  conclusions,  recommendations  and  proposes 
further  researoh  efforts  and  data  collection  activities  to 
continue  refining  the  Reliability  Prediction  and  Estimation 
Methodology.  Suggestions  for  modification  of  the  Sof  ware 
Quality  Measurement  Framework  are  also  proposed. 


1 . 6  EXECUTIVE  SUMMARY 

The  important  results  of  this  effort  can  be  summarized  into  four 
areas.  Each  area  is  briefly  highlighted  here  with  reference  to 
the  sections  in  the  report  vhe’-e  details  can  be  found 

1.  Software  Reliability  Measurements  Framework 

A  framework  is  established  which  spans  the  life 
cycle  of  a  software  system.  The  framework 
acknowledges  the  inputs  of  past  RADC  research  in 
metrics  and  models  as  techniques  to  aid  in  the 
prediction  and  estimation  of  reliability  during  the 
development  process.  Completing  the  framework  are 
the  specification  and  assessment  aspects  of 
reliability  measurement.  Within  the  framework,  the 
specific  data  needed  to  measure  software  reliability 
and  the  utility  of  the  measurements  to  help  make 
sound  software  engineering  decisions  is  addressed. 
The  framework  is  presented  in  Section  2  of  this 
report.  Future  research  and  data  collection  should 
be  focused  by  this  framework. 

2.  Software  Reliability  Data 


This  research  effort  probably  entailed  the  most 
comprehensive  data  collection/compilation  effort 
attempted  to  investigate  software  reliability  Over 
thirty-three  (33)  data  sources  representing  59 
systems  and  over  5  million  lines  of  code  were 
accessed  (including  the  RADC  Data 
Center  for  Software  and  the 
Engineering  Laboratory  Data  Base), 
diversity  of  the  data  collected, 
applicable  obsei vatlons  about  software  reliability 
could  be  made.  This  extensive  data  base  supported 
the  development  of  the  preliminary  guidebook  for 
making  reliablillty  predictions  and  estimations 
Summary  data  and  examples  of  detailed  data  collected 
are  presented  in  section  4  of  this  report. 
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the  techniques  developed  during  this  research 
effort.  Utilizing  the  data  collected  and  the 
metrics  derived  from  analysis,  procedures  are 
provided  vhioh  allow  predictions  and  estimations  to 
be  made  at  various  milestones  during  a  software 
development  project. 

Experiment  Demonstrating  Prediction  and  Estimation 
Techniques 

Section  6  of  this  report  describes  the  application 
of  the  Guidebook  to  an  aotual  projeot.  Comparsions 
of  the  predictions  and  estimations  with  actual 
results  are  provided. 


CHByiniC!ununuf^jrsjrifirw)Qnty^vTCwrusriflri.^Wc^rw^w^7'«’i%,\^'vv\.\,^i".  V V  \,T^^TX.^^^Cpr^fyv*%Tr^wirjir. 


2.0  A  FRAMBVORE  FOR  SOFTY  ARB  RELIABILITY 

PREDICTION  AED  ESTIMATION 


2.1  THE  FRAMEVORZ 

The  current  technology  In  software  reliability,  as  a  result  of 
past  research  efforts,  has  been,  for  the  most  part,  not  accepted 
by  the  reliability  pract loners.  On  one  hand,  nodels  of  software 
reliability  using  metrics  related  to  structural  characteristics 
of  the  software  provided  predictions  of  the  number  of  faults 
expected  in  a  portion  of  the  oode.  This  had  little  relevance  to 
reliability  engineers  because  their  orientation  is  time  Ce.g., 
failure  rate  or  MTBF ) .  On  the  other  hand,  models  of  software 
reliability  using  failure  detection  rates  during  testing  provides 
relevant  data,  but  because  of  necessary  model  assumptions,  the 
lateness  in  application,  and  the  sensitivity  to  the  testing 
approach,  the  models  also  did  not  meet  praotloner's  needs. 

A  framework  developed  during  Phase  I  of  this  effort  attempts  to 
build  upon  both  approaches  and  span  the  entire  life-cycle  in 
applicability.  Figure  2-1  illustrates  the  Reliability  Measure¬ 
ment  Framework. 

The  framework  illustrates  the  following  important  characteris¬ 
tics  : 

e  The  framework  illustrates  reliability  measurement  as  a 
life  cycle  activity. 

e  The  framework  includes  specification  of  reliability  goals, 
prediction  of  reliability  during  the  early  phases  of 
development,  estimation  of  reliability  during  the  later 
phases  of  development,  and  assessment  of  the  achieved 
reliability  during  operations  and  maintenance  (deploy¬ 
ment  ) . 

e  The  framework  combines  the  measurement  techniques  of 
software  quality  metrics  and  reliability  models. 

e  The  techniques  are  described  in  units  which  are  consis¬ 
tent  . 

e  The  measurement  techniques  are  also  described  in  terms 
consistent  with  actual  reliability  measurement . 

e  The  approach  taken  will  lend  itself  to  combination  with 
traditional  hardware  reliability  ooncepts  so  system 
reliability  cam  be  addressed. 

During  the  concept  development  phase,  a  technique  to  specify  the 
software  reliability  goal  of  the  system  is  needed  which  will  be 
compatible  with  similar  hardware  reliability  goals.  The  predic- 
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tion  technique  (Reliability  Pigure-of -Merit)  is  based  on  metrics 
(quantitative  measures)  that  oan  be  taken  during  early  phases  of 
development.  These  metrios  are  predictive  or  indicative  in 
nature.  They  are  based  on  structure,  development  techniques  and 
methods,  and  environment.  The  estimation  technique  (Reliability 
Estimation  Number)  is  based  on  test  results.  The  Estimation 
Humber  is  refined  as  testing  progresses.  During  operation  and 
maintenance,  reliability  assessment  is  conducted.  This  assess¬ 
ment  consists  of  observing  the  actual  achieved  reliability  and 
describing  it  quantitatively. 

This  last  aspect  of  the  framework  is  very  Important  to  the 
useability  of  the  methodology.  By  requiring  that  the  techniques 
relate  to  actual  measurement,  the  likelihood  of  acceptance  with 
the  practitioner  community  is  much  greater.  The  techniques 
become  more  understandable  and  relate  to  goals  that  are  speci¬ 
fied. 

To  make  the  approaches  compatible,  software  reliability  must  be 
expressed  in  terms  of  failure  rate.  The  time  unit  of  measure  of 
the  failure  rate  must  be  in  terms  of  execution  time  because  this 
is  conceptually  equivalent  to  hardware  operating  time.  Figure 
2-2  illustrates  this  relationship  between  hardware  and  software 
reliability.  Appendix  A  provides  definitions  and  terminology 
related  to  this  framework. 


2.2  UTILITY  OP  RELIABILITY  KBASUREMKHT  TBCHHUjUBS 

A  major  goal  of  this  study  is  to  define  reliability  prediction 
and  estimation  concepts  so  they  are  useful  to  Air  Force  user6 .  A 
first  6tep  in  achieving  this  goal  is  to  identify  what  needs  these 
concepts  must  satisfy,  or  what  utility  they  oan  provide  to  Air 
Force  users. 

The  Air  Force  organizations  to  be  discussed  are  end-users  (e.g., 
SAC  and  TAC),  System  Acquisition  Managers  (SAMs)  and  System 
Program  Offices  (SPOs)  such  as  BSD  and  ASD.  Air  Porce  Plant 
Representatives  (APPRO),  Test  and  Evaluation  organizations  such 
as  AFOTBC ,  Life  Cycle  Agents  such  as  ALCs  ( AFLC) ,  research 
organizations  such  as  RADC,  developers  (in  most  cases  contract¬ 
ors),  and  Independent  Verification  and  Validation  contractors. 
Figure  2-3  illustrates  the  relationship  of  these  organizations  on 
a  typical  development . 

The  techniques  these  organizations  will  be  involved  in  using 
Include  specifying  reliability  goals,  predicting  reliability 
during  early  phases  of  the  development,  estimating  reliability 
during  the  testing  phases,  observing  actual  reliability 
performance  (assessment)  during  operations  and  maintenance,  and 
assessing  what  improvements  can  be  initiated  to  improve  the 
design  and  production  prooess  to  improve  software  reliability. 
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Their  use  of  the  four  techniques  and  their  Involvement  In  the 
various  phases  of  a  development  is  illustrated  in  Figure  2-4. 
The  following  paragraphs  desoribe  the  involvement  in  more  detail. 

2.2.1  Utility  Daring  Concept  Development /Acquisition  Initia¬ 
tion/Mission  and  System  Requirements  Definition  Of  A  Major 
Project 

During  the  concept  development  of  a  major  project  that  is 
dependent  on  software  for  a  critical  part  of  its  function,  there 
is  frequently  a  general  concern  about  the  ultimate  reliability 
that  can  be  attained.  The  end  users  and  SAMs  are  involved  in 
this  phase.  Reliability  may  be  required  in  connection  with 
safety,  as  in  a  digital  fly-by-wire  system  for  aircraft,  or  it 
may  be  desired  on  the  basis  of  general  mission  goals,  as  in  an 
area  air  defense  system.  The  central  question  in  both  circum¬ 
stances  is  'will  the  operational  reliability  meet  the  minimum 
requirements  for  the  intended  application?'  If  this  is  answered 
in  the  affirmative,  the  project  may  proceed.  If  it  is  answered 
in  the  negative,  alternative  approaches  will  have  to  be  investi¬ 
gated.  Thus  at  concept  development,  a  predicted  reliability 
number  is  needed  for  the  concept  architecture  proposed  to  compare 
it  with  the  required  system  reliability .  Required  reliability 
must  be  specified  as  a  goal  and  incorporated  in  system  require¬ 
ments  specifications  and  acquisition  documents. 

If  the  forecasted  reliability  satisfies  the  minimum  requirements 
(and  if  other  conditions  are  met),  the  project  acquisition  will 
be  initiated.  Here  the  concern  shifts  to  establishing  milestones 
at  which  it  can  be  determined  whether  adequate  progress  is  being 
made  toward  meeting  the  reliability  goals.  Thus,  there  is  at 
least  an  Implicit  requirement  for  a  model  of  the  process  by  which 
reliability  is  being  attained,  such  as  the  elimination  of  faults 
in  the  design  and  code.  Three  related  questions  sum  up  the 
primary  objectives  for  this  phase: 

•  "What  milestones  oan  be  established  to  verify  the  attain¬ 
ment  of  reliability  goals  during  the  course  of  the 
development?" , 

•  "What  are  the  key  measures  that  can  be  obtained  at  each 
one  of  the  milestones?",  and 

•  "What  techniques  should  be  required  of  the  developer  to 
promote  reliable  software  development?". 

These  questions  demand  a  detailed  understanding  of  the  software 
failure  process.  The  answers  to  these  questions  result  in  a 
software  reliability  test  plan,  at  least  to  the  level  where  tests 
are  identified  by  name,  scope  of  the  system  under  test,  and  test 
objectives.  The  System  Program  Office  (SPO),  the  developer,  and 
the  Test  Agent  are  involved  in  this  prooess  of  identifying 
definitive  reliability  goals  and  test  plans. 
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FIGURE  2-4.  AIR  FORCE  ORGANIZATIONAL  INVOLVEMENT  IN 
RELIABILITY  MEASUREMENT 


8. 3. a  Utility  Daring  Early  Software  Development  Phases  of 
tsqttlnmti  Analysis,  Preliminary  Design  Detailed  Design 
and  Coding 

During  the  phases  of  systen  development,  the  SAM/SPO  management 
is  oonoerned  with  trade-offs  of  broad  scope,  e.g.,  allocation  of 
functions  to  hardware,  software,  and  personnel.  The  prinoipal 
reliability  oonoern  in  these  activities  is  the  effect  of  the 
decisions  on  the  global  reliability  of  the  system,  and  a  single 
measure  of  forecasted  software  reliability  in  the  operational 
environment  is  usually  sufficient.  These  objectives  are  similar 
to  those  described  under  the  planning  phase  above. 

As  the  development  proceeds  through  the  development  milestones, 
the  software  reliability  goals  that  were  established  during  the 
initiation  phase  should  be  evaluated  and  technical  management 
will  want  to  determine  that  the  milestones  have  been  attained. 
This  may  Involve  direct  measurement  of  software  reliability  or. 
particularly  at  the  early  milestones,  evaluation  of  predictors  of 
software  reliability .  At  this  stage  the  establishment  of 
objective  and  accessible  measurement  criteria  is  essential. 

If  it  is  determined  that  milestone  objectives  have  not  been 
attained,  a  reoovery  plan  must  be  prepared.  Typically,  this 
involves  corrective  actions  modifying  the  software  system 
architecture,  the  design,  or  the  oode. 

Software  Development  Management  is  Interpreted  here  as  thcs« 
organizational  activities  in  a  project  that  are  directly  charge! 
with  oversight  of  the  software  development,  test,  and  Integra 
tion.  The  objectives  of  the  higher  level  managers  of  the 
software  activities  within  the  developing  organization  are 

expected  to  have  similar  objectives,  particularly  where  software 
development  is  subcontracted  and  must  be  managed  as  a  separate 
activity. 

In  the  oontext  described  above,  software  management  has  received 
operational  reliability  goals  and  requirements  to  be  met  at 
specified  milestones  during  the  development  which  were  generated 
as  outlined  in  the  preceding  paragraphs.  These  goals  must  be 
allocated  to  individual  software  segments,  and  it  is  also 
generally  desired  to  establish  sore  detailed  evaluation  criteria 
so  that  the  probability  of  attaining  the  milestone  requirements 
cam  be  gauged  during  the  development  prooess.  From  these  respon- 
slbllltes  arise  objectives  for  software  reliability  forecasting 
at  a  much  more  detailed  level  than  found  in  the  prior  discussion. 
At  the  same  time,  software  management  has  access  to  much  more 
specific  information  about  the  structure,  content,  and  develop¬ 
ment  environment  of  the  product . 

Where  the  attainment  of  milestones  or  of  the  ultimate  reliability 
goals  appears  in  doubt,  means  of  gauging  the  effects  of  several 
alternatives  for  reliability  improvement  are  desired.  Candidate 
alternatives  may  involve  a  new  design  for  the  program  or  for  the 
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data  structure,  improved  test  techniques,  or  the  adoption  of 
software  fault  containment  or  fault  toleranoe  techniques.  These 
types  of  software  engineering  decisions  will  be  driven  by  the 
reliability  prediotors.  The  reliability  prediotion  and  estima¬ 
tion  techniques  should  support  an  objective  and  aoourate  evalua¬ 
tion  of  the  effeots  of  these  alternatives.  During  this  phase, 
the  forecasting  techniques  are  used  to  evaluate  progress  and 
assist  in  the  reliability  engineering.  A  quality  assuranoe  or 
reliability  engineering  group  within  the  developer's  organization 
or  an  IWV  contractor  would  most  likely  be  involved  in  taking 
these  detailed  measures.  The  software  development  team  within 
the  developer's  organization  would  use  measures  to  make  software 
engineering  decisions. 


2.2.3  Utility  During  Test  Phases  and  Acceptance 


The  observed  system  reliability  during  the  various  phases  of 
testing  and  eventually  during  acceptance  testing  can  be  the  basi6 
acceptance/ rejection  of  the  system.  If  a  goal  is  oontractu- 


for 


ally  stated  and  the  acceptance  test  procedure  specifically 
identifies  that  goal  as  an  acceptance/rejection  criterion,  then 
use  of  this  technique  oan  have  significant  importance  to  the 
developer.  The  developer  is  Involved  in  performing  system 
testing.  An  independent  Test  and  Evaluation  organization  or  an 
IWV  contractor  may  be  Involved  in  conducting  Independent  tests 
to  assess  reliability.  The  SPO  and  SAM  are  involved  in  accepting 
the  system.  The  Test  Agent  is  involved  in  operational  testing 
phases . 


2.2.4 


Utility  During  Transition  To  Operational  Use  (Deployment) 
and  Operations  and  Maintenance 


Although  the  planning  and  initiation  activities  had  generated  a 
time  phased  series  of  milestones  that  should  lead  to  the  desired 
software  reliability  in  operational  U6e,  there  usually  arise  a 
considerable  number  of  questions  about  software  reliability  as 
the  date  for  out-in  approaches.  The  goals  established  during 
planning  vere  of  necessity  quite  general  and  may  no  longer  be 
applicable  to  the  structure  of  the  system  and  software  as  they 
are  being  delivered.  It  is  quite  typical  to  observe  during  the 
cut-in  period  many  failures  associated  with  the  software  that  are 
not  truly  software  failures  but  are  the  result  of  procedural 
mistakes  or  of  Inconsistencies  between  the  specified  and  the 
actual  environment .  The  objectives  of  software  reliability  at 
this  point  relate  primarily  to  reporting  and  measurement  pro¬ 
cedures,  with  emphasis  on  distinguishing  between  events  where  the 
software  failed  to  meet  its  specification  (the  frequency  of  these 
can  be  Interpreted  as  indicative  of  operational  reliability)  and 
events  that  are  primarily  due  to  the  transition  process  and  which 
are  therefore  not  expected  to  persist  during  steady  state 
operation.  The  life-cycle  agent  and  end  user  are  involved  in 
this  process. 


After  a  system  has  become  operational,  a  software  reliability 


2-9 


:::■<$ 

/V/J 


m 


;-&Sa 


mm 


goal  is  to  exhibit  a  pattern  of  continued  deorease  of  failure 
frequency  and.  oonooaitant  with  this,  to  Identify  and  prevent 
oauses  of  inoreasing  failure  frequency.  The  utility  of  the 
reliability  measurements  are  the  ability  to  assess  the  reliabil¬ 
ity  aotually  aohieved  within  the  system .  Typioal  oauses  of  poor 
reliability  inolude  inadequate  software  maintenance .  instability 
of  the  hardware  or  software  oonfiguration.  and  laoh  of  oommunloa- 
tion  regarding  ohanges  in  user  requirements  or  expectations.  The 
emphasis  is  on  aeasureaents  that  are  efficient  in  identifying 
ohanges  in  trends.  Again  the  end  user  and  life-oyole  agent  play 
key  roles  in  maintaining  and  improving  the  reliability  perform¬ 
ance  of  the  system. 


2.3  SOFTWARE  RELIABILITY  EEGIVBERIEG  MARAGEMBBT 

figure  2-5  identifies  many  of  the  activities  sited  in  the  above 
paragraphs  aooording  to  detailed  life-oyole  phases.  The  availa¬ 
bility  of  speoifio  aeasureaents  and  prediotive  and  estimation 
techniques  will  faoilitate  the  performance  of  these  activities 
during  software  developments.  These  activities  represent  are 
Software  Reliability  disolpline  that  should  be  incorporated  in 
software  development.  This  discipline  has  aspeots  that  are 
management -related.  development-related,  quality  assurance- 
related,  and  test-related. 

Figure  2-0  highlights  the  types  of  questions  that  the  reliability 
measurement  techniques  will  help  answer. 
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9.0  CANDIDATE  RELIABILITY  MBASURBKB1T8 


The  Software  Reliability  Measurement  Framevork  illustrated  in 
Figure  2-1  in  Seotion  2,  identified  two  measurement  objectives 
that  were  the  focus  of  this  researoh  effort.  They  are  a  Predic¬ 
tive  Software  Reliability  Figure-of-Merit  (RP)  and  a  Reliability 
Estimation  Humber  CRB).  The  predictive  RP  is  derived  from 
measurements  taken  in  the  early  life  cycle  phases  of  a 
development,  when  based  on  the  characteristics  of  the  evolving 
software  system  a  prediotion  can  be  made  of  the  reliability  of 
the  software.  The  RB  is  an  estimation  of  the  reliability  based 
on  the  observed  failure  rate  of  the  software  during  the  test 
phases  of  the  development.  This  seotion  describes  the  candidate 
measurements  which  were  identified  for  each  of  those  numbers. 
Also  described  in  this  section  are  the  relationship  of  these 
candidate  metrios  to  the  RADC  Software  Quality  Measurement  Frame¬ 
work,  when  during  the  life-cycle  these  candidate  measurements 
apply,  and  Data  Collection  Procedures  for  calculating  the 
metrics  Section  4  of  this  report  describes  the  data  collected 
to  calculate  these  metrics.  Seotion  5  describes  the  process  and 
results  of  the  validation  efforts  with  these  metrios. 

3.1  SOFTWARE  QUALITY  MEASUREMENT  FRAMEWORK 

A  Software  Quality  Measurement  Framework  was  established  in 
Factors  in  Software  Quality.  RADC-TR-77-369.  That  framework  had 
a  basic  structure  illustrated  in  Figure  3-1.  From  that  initial 
report,  four  quality  factors  are  identified  that  relate  and 
Impact  software  and  system  reliability: 

Software  Reliability:  The  extent  to  which  a  program  can  be 
expected  to  perform  its  intended  funotion  with  required 
precision . 

Software  Correctness:  The  extent  to  which  a  program  satis¬ 
fies  its  specifications  and  fulfills  the  user's  mission 
objectives . 

Software  Maintainability:  The  effort  required  to  fix  an 
error  in  an  operational  program. 

Software  Testability:  The  effort  required  to  verify  the 
specified  software  operation  and  performance. 

A  more  recent  report,  Specification  of  Software  Quality  Attri¬ 
butes.  RADC -TR- 83 -37 ,  expands  these  faotors  to  the  following: 

Reliability:  Extent  to  which  the  software  will  perform 

without  any  failures  within  a  specified  time  period. 

Survivability.  Extent  to  which  software  will  perform  and 
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support  critical  functions  without  failures  within  a  speci¬ 
fied  time  period  when  a  portion  of  the  system  is  inoperable. 


Correctness:  Extent  to  which  the  software  conforms  to  its 
specifications  and  requirements. 

Maintainability:  Ease  of  effort  for  locating  and  fixing  a 
software  failure  within  a  specified  time  period. 

Verifiability:  Relative  effort  to  verify  the  specified 
software  operation  and  performance. 

Table  3-1  illustrates  the  criteria  and  metrics  related  to  these 
factors.  Each  of  these  metrics  were  considered  in  arriving  at 
the  candidate  measurements  for  the  RP  and  RB.  Also  considered 
specifically  for  applicability  to  the  RE  were  the  reliability 
models  mentioned  in  Section  1  and  described  in  [GOBL83]. 


3.2  A  SOFTWARE  RELIABILITY  MEASUREMENT  MODEL 

The  framework  presented  in  Section  2  represents  a  life-oycle  view 
of  software  reliability  measurement .  The  heart  of  the  framework 
is  the  ability  during  the  development  phases  to  predlot  and 
estimate  software  reliability.  These  predictions  and  estimations 
are  comparable  to  the  specified  reliability  requirements  and 
eventually  to  the  observed  operational  reliability. 

3.2.1  A  Model  Of  The  Software  Failure  Process 

In  order  to  identify  the  software  measurements  to  be  used  to 
predict  and  estimate  software  reliability  we  need  to  understand 
how  software  fails  (i.e.,  what  we  are  predicting  and  estimating) 
and  how  we  can  organize  the  candidate  measures  according  to  their 
value  as  predictive  or  estimation  metrics. 

Software  does  not  fail  in  the  sense  of  a  permanent  physical  state 
change  such  as  is  usually  associated  with  hardware  failures. 
Nevertheless,  it  has  become  customary  to  refer  to  software 
failures  as  a  shorthand  term  for  failures  in  the  computing 
process  vhloh  are  oaused  by  the  software.  A  graphical  represen¬ 
tation  of  that  failure  process  is  shown  in  Figure  3-2.  In  the 
strictest  sense,  the  failure  is  an  event  that  oauses  a  binary  bit 
pattern  inside  the  computer  to  take  a  wrong  value,  shown  inside 
the  larger  box  in  the  figure. 

Typically,  this  event  is  not  actually  observed,  but  the  evidence 
that  a  failure  has  occurred  is  found  in  an  lnoorreot  value  at  the 
output  of  the  oomputer,  i.e.,  an  error  (as  defined  in  Appendix 
A).  Not  every  error  is  observed,  and  since  the  reliability 
values  produced  by  the  prediction  and  estimation  techniques 
should  agree  with  those  eventually  observed,  the  predictions  and 
estimations  must  be  adjusted  for  the  degree  to  which  errors  are 
expected  to  be  observed.  The  observation  takes  plaoe  in  the 
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TABLE  3-1.  CANDIDATE  METRICS  FROM  SOFTWARE  QUALITY 
MEASUREMENT  FRAMEWORK 


FACTOR  CRITERION 


ACCURACY 

ANOMALY  MANAGEMENT 


SIMPLICITY 


AUTONOMY 

S  DISTRIBUTE  ON  ESS 

S.M.V  MODULARITY 

S  RECONFIGURABILITY 

COMPLETENESS 
CM  CONSISTENCY 

C  TRACEABILITY 

SELF  OESCRIPTIVENESS 

VISIBILITY 


METRIC 

ACRONYM 

metric 

AM.1 

ACCURACY  CHECKLIST 

AMI 

ERROR  TOLERANCE/CONTROL 

2 

IMPROPER  INPUT  DATA 

3 

COMPUTATIONAL  FAILURES 

.4 

HARDWARE  FAULTS 

5 

DEVICE  ERRORS 

6 

COMMUNICATIONS  ERRORS 

.7 

NOOE/COMMUNICATIONS  FAILURES 

SI  1 

DESIGN  STRUCTURE 

2 

STRUCTURED  LANGUAGE  OR  PREPROCESSOR 

3 

DATA  AND  CONTROL  FLOW  COMPLEXITY 

.4 

COOING  SIMPLICITY 

5 

SPECIFICITY 

8 

HALSTEAD’S  LEVEL  OF  DIFFICULTY 

AU.1 

INTERFACE  COMPLEXITY 

2 

SELF  SUFFICIENCY 

01.1 

DESIGN  STRUCTURE 

MO  1 

MODULAR  IMPLEMENTATION 

MO. 2 

MODULAR  DESIGN 

RE.1 

RESTRUCTURE 

CP.1 

COMPLETENESS  CHECKLIST 

CS.1 

PROCEDURE  CONSISTENCY 

CS.2 

DATA  CONSISTENCY 

TC.1 

CROSS  REFERENCE 

S0.1 

QUANTITY  OF  COMMENTS 

2 

EFFECTIVENESS  OF  COMMENTS 

3 

DESCRIPTIVENESS  OF  LANGUAGE 

VS.1 

UNIT  TESTING 

2 

INTEGRATION  TESTING 

3 

CSCI TESTING 

R  -RELIABILITY 

S  -Survivability 

c  -correctness 


m  -  maintainability 
V  -verifiability 
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FIGURE  3-2.  BASIC  SOFTWARE  FAILURE  MODEL 


operating  environment,  and  the  methodology  for  accounting  for 
observation  in  the  estimation  is  part  of  an  environment  factor. 
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Some  faults  in  the  code  will  produce  an  error  during  every 
execution.  These  are  normally  corrected  very  early  during 
checkout  by  the  developer  even  before  the  program  enters  formal 
testing.  Failures  that  are  of  concern  in  software  reliability 
measurement  for  Air  Force  projects  usually  come  about  when  a  rare 
external  event  (data  set  or  computer  state)  causes  the  execution 
of  the  code  to  differ  in  some  way  from  the  routine  manner.  A 
software  fault  that  had  previously  been  present ,  but  not  resulted 
in  an  error  has  thereby  been  revealed.  Both  the  presence  of 
faults  in  the  oode  and  the  occurrence  of  triggering  events  will, 
therefore,  affect  software  reliability. 

3.2.2  Organization  Of  Software  Reliability  Measurements 

Two  broad  olasses  of  software  reliability  metrios  have  been 
addressed  in  the  literature,  based,  respectively,  on  fault 
content  of  the  oode  and  on  the  number  of  failures  encountered 
during  servioe.  The  common  normalized  forms  of  these  are  fault 
density  and  failure  rate.  Because  the  latter  measure  can  be 
combined  with  conventional  hardware  reliability  metrics  to  yield 
a  single  expression  for  computer  system  reliability  it  is  being 
given  preference.  However,  there  are  some  situations  in  which 
fault  density  is  either  the  only  measure  available  or  is  a  more 
convenient  expression  to  use.  Therefore,  it  is  also  covered  in 
the  following  discussion. 

3 . 2 . 2 . 1  Fault  Density 

The  software  user  wishes  to  procure  fault-free  code,  and  the 
software  developer  has  economic  Incentives  to  want  to  meet  the 
user's  requirements.  It  is  recognized  that  completely  fault-free 
code  for  a  large  project  is  not  within  the  present  capabilities, 
and  thus  a  measure  for  relative  freedom  from  faults  is  required. 
Fault  density  has  been  found  a  useful  and  meaningful  metric.  One 
of  the  first  to  provide  quantitative  data  on  fault  density  was 
F.  Aklyama  [AKIY71J .  He  reported  an  average  fault  density  of  1% 
in  programs  entering  formal  test,  and  this  number  has  been 
repeatedly  confirmed  in  other  publications.  Modern  programming 
techniques  have  produced  some  improvement,  and  a  declining  trend 
has  been  noted.  For  recent  HOL  programs,  an  order  of  magnitude 
improvement,  .1%,  appears  to  be  representative  [HECHS3] . 

Fault  density  oan  be  expressed  as  the  number  of  faults  found  in 
total  lines  of  oode  or  in  executable  lines  of  code,  and  a  dis¬ 
tinction  must  be  made  between  these.  The  measure  used  in  this 
report  is  based  on  executable  lines.  It  is  also  Important  to 
recognize  that  a  single  line  of  HOL  code  usually  replaces  2  to  8 
lines  of  assembly  language  code,  depending  on  the  higher-order 
language . 

Fault  density  has  the  following  advantages  as  a  reliability 
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metrio: 


•  It  appears  to  be  a  fairly  invariant  number. 

e  It  oan  be  obtained  from  commonly  available  data. 

•  It  is  not  directly  affected  by  variables  in  tbe  environ¬ 

ment  (but  testing  in  a  stressful  environment  may  produce  a 
higher  value  than  testing  in  a  passive  environment). 

•  Conversion  among  fault  density  metrios  is  fairly  straight¬ 
forward  (see  above). 

•  The  metrio  facilitates  combination  of  faults  found  by 
inspection  with  those  found  during  execution  since  the 
time  element  of  the  later  is  not  accounted  for. 

The  major  disadvantages  are: 

e  It  cannot  be  combined  with  hardware  reliability  metrics. 

e  It  does  not  relate  to  observations  in  the  user  environ¬ 

ment  . 

e  There  is  no  assuranoe  that  all  faults  have  been  found. 

3 . 2 . 3 . 2  Failure  Rate 

The  inoldenoe  of  software  failures  (as  distinot  from  the  presence 
of  faults  in  the  code)  is  viewed  as  an  undesirable  characteristic 
by  the  user.  The  frequency  of  failures  in  a  specified  time 
Interval  is  therefore,  a  measure  of  unreliability  as  seen  by  the 
user,  or,  oonversely,  the  time  between  failures  is  a  measure  of 
reliability.  Metrios  of  this  type  based  on  elapsed  time  (also 
referred  to  as  wall  dock  time)  are  not  meaningful  for  assessment 
of  the  inherent  reliability  of  the  software  produot  because  they 
are  not  direotly  related  to  the  exposure  to  failure.  Thus,  for  a 
computer  that  is  not  in  use  during  weekends  it  will  be  found  that 
the  software  failure  rate  (in  wall  dock  time)  during  that  period 
is  a  very  satisfactory  zero.  Unfortunately,  during  the  week  when 
it  is  in  use,  it  has  a  finite  value.  This  has  given  rise  to  some 
very  erroneous  assessments  of  software  reliability  because  the 
elapsed  time  failure  rate  tends  to  lnorease  during  periods  of 
heavy  test  aotlvity  simply  beoause  more  usage  hours  are  being 
logged  per  oalendar  day.  The  increasing  trend  oauses  concern, 
reflected  in  yet  higher  test  aotlvity  and  higher  apparent  failure 
rates. 

To  avoid  these  lnoonsistenoies,  failure  rates  based  on  execution 
time  have  been  proposed,  and  their  use  has  led  to  much  more 
satisfactory  results  [MUSA7S ,  HECH77] .  Failure  rates  based  on 
execution  time  or  an  alternative,  oomputer  operation  time,  will 
be  used  throughout  this  project.  Execution  time  is  the  interval 
during  whioh  the  central  processing  unit  (CPU)  of  the  computer 
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exeoutes  the  program.  It  is  only  during  execution  of  the  program 
that  failures  will  be  encountered.  The  ratio  of  execution  time 
to  wall  dock  time  may,  therefore,  be  thought  of  as  the  duty 
cycle  of  the  software. 

On  most  mainframes,  the  operating  system  reports  the  execution 
time  for  each  program  or  project  on  a  run  basis  and  also  computes 
daily,  weekly,  or  monthly  totals.  Where  these  reports  are  not 
available,  execution  time  may  be  expressed  in  computer  operation 
time,  the  time  during  whioh  the  computer  (as  contrasted  with  the 
CPU)  executes  the  program.  Computer  operation  time  exceeds  CPU 
time  (in  the  range  of  two  to  ten  times  CPU  time)  because  it  also 
includes  time  for  mass  storage  access,  output  functions,  etc. 
Proper  methods  of  converting  computer  time  to  CPU  time  or 
equivalent  acceptable  measures  are  discussed  later  in  this 
section. 

Failure  rate  measurements  based  on  execution  time  have  the 
following  advantages: 

•  Observable  and  meaningful  in  the  operating  environment. 

0  Can  be  oomputed  over  any  time  interval  limited  only  by 
statistical  averaging  considerations. 

•  Can  with  proper  procedures  be  combined  with  hardware 
failure  rate  to  yield  a  computer  system  failure  rate. 

They  have  the  following  disadvantages: 

•  Affected  by  conditions  in  the  environment . 

•  Do  not  Include  faults  found  by  inspection. 

•  Require  measurement  or  estimation  of  execution  time. 

It  is  intuitive  that  fault  density  is  a  self-normalization 
metric,  l.e.,  it  measures  a  characteristic  of  the  code  that  is 
not  directly  affected  by  the  length  of  the  program.  The 
execution-time-based  failure  rate  is  self-normalizing  in  the  same 
manner  because  a  long  program  will  have  a  longer  running  time 
than  a  short  one. 

3. 2. 2. 2.1  Execution  Ratio 

There  are  some  environments  in  which  it  is  possible  to  obtain  the 
computer  time  but  not  the  execution  time,  e.g.,  avionics  com¬ 
puters  and  militarized  microcomputers.  Failure  rate  measurements 
based  on  computer  time  can  also  be  used  for  monitoring  the 
relative  progress  of  a  given  software  package  in  the  same  manner 
as  the  failure  ratio  disoussed  in  the  subsequent  paragraph. 
These  failure  rate  measurements  can  also  be  used  for  comparisons 
between  modules  as  long  as  all  run  on  the  same  computer  type. 
Failure  rate  estimation  based  on  computer  time  can  be  implemented 
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However,  there  will  be  many  instances  in  which  it  is  desirable  to 
convert  oomputer  time  to  execution  time,  particularly  in  the 
utilization  of  software  reliability  prediction.  A  number  of 
methods  can  be  used  for  this  conversion: 

•  Running  a  benchmark  HOL  program  on  a  mainframe  on  which 
execution  time  will  be  reported,  and  then  running  the  same 
test  case  on  the  target  oomputer. 

•  Running  a  program  on  the  target  computer  in  a  manner  that 
will  eliminate  or  minimize  disk  aocess  (e.g.,  by  putting 
data  in  memory)  and  output  operations,  thus  obtaining 
essentially  an  execution  time  measurement,  and  then 
running  the  same  test  case  in  the  normal  manner. 

•  By  oounting  the  number  of  I/O  operations  involved  in  a 
program  and  computing  the  nominal  time  for  these  from  the 
oomputer  instruction  manual. 

•  Benchmarking  a  program  with  timers  and  counters  during 
IOTfiTB  (operational  environment). 

Depending  on  the  purpose  for  which  the  software  reliability 
measurement  is  to  be  used,  it  may  be  necessary  to  modify  the 
direct  execution  time  based  metrio  that  was  introduced  in  the 
preceding  paragraph.  Execution  time  can  be  dispensed  with 
entirely  when  reliability  measurements  are  being  carried  out  to 
traok  the  progress  of  a  given  software  package  during  a  test  or 
modification  program.  Since  only  a  measure  of  relative  improve¬ 
ment  is  desired,  and  since  the  execution  time  of  the  program  will 
be  reasonably  constant ,  the  failure  ratio  rather  than  failure 
rate  can  be  used.  The  failure  ratio  is  computed  by  dividing  the 
number  of  runs  that  failed  by  the  number  of  successful  runs 
during  a  specified  time  interval,  e.g.,  one  week  or  one  month. 
This  method  oan  be  used  as  a  primitive  form  of  software  reliabil¬ 
ity  estimation  (the  failure  ratio  rather  than  the  failure  rate  is 
being  estimated).  The  advantage  of  this  variant  is  that  it  can 
be  implemented  in  praotioally  any  computing  environment  whereas 
execution  time  based  measurements  require  an  operating  system 
that  logs  exeoution  time.  The  major  disadvantage  is  that  the 
failure  ratio  oannot  be  used  for  comparison  among  programs  of 
different  size  or  running  on  different  oomputers  beoause  it  is 
not  self-normalizing . 

3. 2. a. 2. 2  Failures  Per  Execution 

The  failure  rate  based  on  execution  time  is  a  meaningful  number 
that  can  be  used  for  global  comparisons  if  applied  to  computers 
of  a  given  olass,  e.g.,  32-bit  machines  in  the  6  MIPS  range 
(million  instructions  per  second).  The  failure  rate  is  not 
suitable  for  comparisons  among  computers  of  different  word 
formats  or  performance  classes.  It  is  misleading  to  compare  the 
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failure  rate  on  a  16-bit  avionios  computer  that  ezeoutes  at  2 
MIPS  with  that  of  a  60-bit  mainframe  exeouting  at  20  MIPS.  The 
latter  maohi&e  processes  approximately  40  times  as  mu oh  informa¬ 
tion  in  a  given  time  interval,  and  if  the  identical  test  oases 
were  run  on  it  (only  theoretically  possible)  the  observed  failure 
rate  would  have  been  40  times  that  on  the  avionios  computer. 

For  global  oomparisons  involving  oomputers  that  differ  signifi¬ 
cantly  in  performance,  it  is  neoessary  to  divide  the  exeoution 
time  based  failure  rate  oy  the  number  of  bits  exeouted  per  seoond 
on  each  of  the  oomputers.  A  16-bit  oomputer  operating  at  2  MIPS 
ezeoutes  32  megabits  per  seoond,  and  the  60-bit  oomputer  operat¬ 
ing  at  20  MIPS  ezeoutes  1200  megabits  per  seoond.  These  faotors 
transformed  the  time-based  failure  rate  into  a  failure  rate  based 
on  information  prooessed,  i.e.,  failures  per  executions.  The 
latter  usually  has  little  meaning  in  am  operational  environment 
and  should  be  used  only  for  researoh  or  global  oomparisons. 
Another  form  of  this  same  type  of  measurement  is  failures  per 
instructions  prooessed. 


•'v'v'V 

> V  V. 

.v%y- 


VsfV 


%  v  v 
■  V.*v 

-  *  - -  V 


>vv 


Thus  many  basio  units  of  measurement  for  reliability  have  been 
considered  including  fault  density,  failure  rate  (both  exeoution 
time  and  oomputer  time  based),  failure  ratio  (information 
processed  or  instructions  prooessed) .  Further  discussions  of 
alternative  failure  rate  reliability  measures  oan  be  found  in 
[THIB84] . 

3. 2. 2. 3  A  Proposed  Structure 

Our  choloe  as  a  principal  unit  of  measure  for  expressing  software 
reliability  is  the  failure  rate.  Hovever,  early  in  the  develop¬ 
ment  phases,  the  available  data  is  more  applicable  to  predicting 
a  fault  density.  Our  approaoh  is  to  predlot  a  fault  density 
based  on  measurements  taken  early  in  the  development  phase, 
develop  a  transformation  function  to  interpret  that  fault  density 
as  a  predioted  failure  rate,  and  then  during  the  later  phases  of 
development  (testing)  use  an  estimation  based  on  failure  rate.  A 
basio  measurement  model  is  illustrated  in  Figure  3-3,  where  ve 
recognize  that  software  falls  beoause  it  has  faults  (fault 
density  represents  the  number  of  faults  in  the  software  based  on 
its  quality)  and  beoause  of  the  environment  in  which  it  will  be 
used  (trigger  rate  represents  the  variability  of  inputs,  the 
severity  of  the  operational  environment,  eto).  The  transformation 
function  between  fault  density  and  failure  rate  was  developed 
through  empirloal  analyses  and  is  presented  in  Seotion  3. 

3.3  RELATIONSHIP  OP  CA1DIDATB  METRICS  TO  STRUCTURE 

With  this  view  of  software  reliability,  the  oandldate  measure¬ 
ments  (metrics)  disoussed  earlier  in  this  seotion  and  new 
measurements  identified  during  this  researoh  effort  can  be 
organized  as  follows. 


3-10 


* 


m 


failure  rate 


A  FUNCTION  OF: 

■  Development  Environment 

■  Software  Characteristics 


A  FUNCTION  OF: 

■  Test  Environment 

■  Test  Thoroughness 

■  Operations!  Environment 


FIGURE  3-3.  MEASUREMENT  STRUCTURE 
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Those  measurements  which  can  he  applied  early  in  the  development 
and  represent  an  assessment  of  the  quality  of  the  software  can  be 
related  to  a  measure  of  fault  density  and  eventually  transformed 
to  a  prediotive  failure  rate. 

Those  measurements  which  are  applied  late  in  the  development  and 
represent  an  assessment  of  the  performance  of  the  software  during 
testing  can  be  related  to  the  trigger  rate. 

Table  3-2  illustrates  the  allocation  of  candidate  measurements  to 
a  predictive  reliability  number  and  a  reliability  estimation 
number.  The  measurements  shown  are  described  in  the  following 
paragraphs.  Data  collection  procedures  for  each  metric  are  in  an 
Appendix  B  to  Volume  II  of  this  report. 

In  order  to  maintain  consistent  terminology,  the  following 
conventior  ;  will  be  followed: 

•  The  Predictive  Reliability  Figure-of-Merit  (RP)  and  the 
Reliability  Estimation  Number  (RE)  will  be  called 
reliability  numbers . 

•  Metrios  or  measures  are  derived  values  which  when  multi¬ 
plied  together  will  calculate  one  of  the  reliability 
numbers.  A  metric  can  be  a  simple  metric  (e.g.,  D, 
Development  Environment)  or  a  composite  metric  (e.g.,  S, 
Software  Characteristics)  which  is  the  product  of  more 
than  one  simple  metric. 

•  Data  items  are  specific  data  elements  which  must  be 

collected  or  measured  in  order  to  derive  a  metric.  The 
data  items  associated  with  each  metrio  are  described  in 
the  Data  Collection  Procedures  and  worksheets  in 

Appendices  B  and  C  to  Volume  II. 

In  all  cases,  metrio  values  were  derived  from  data  collection  and 
statistical  analyses  performed  on  past  projeots  or  during  latter 
phases  of  this  research  project. 

3.3.1  Prediotive  Metrios 

In  the  past,  software  quality  metrics  have  not  met  with  wide 
acceptance  beoause  there  are  a  large  number  of  them,  they  are 
expensive  to  collect  (manually),  and  they  have  not  all  been 
validated.  In  order  to  avoid  these  problems  the  following 
approach  was  adopted  on  this  study: 

•  The  software  quality  metrics  (see  Table  3-1)  were  reviewed 
to  determine  which  metrics  were  predictive  in  nature. 
Many  of  the  metrics  currently  defined  in  the  Software 
Quality  Measurement  Framework  are  in  effect  standards, 
i.e.,  if  the  metrio  or  metrio  worksheet  item  has  a  low 
score  it  should  be  corrected.  These  metrics  are  used  in 
Just  that  way  by  pract loners,  as  QA  or  IVW  checklists,  to 
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TABLE  3-2.  PREDICTIVE  AND  ESTIMATION  METRICS 


PREDICTIVE  METRICS 

APPLICATION  TYPE 

A 

DEVELOPMENT  ENVIRONMENT  D 
SOFTWARE  CHARACTERISTICS  S 

REQUIREMENTS  AND  DESIGN  REPRESENTATION  SI 

ANOMALY  MANAGEMENT 
TRACEABILITY 
QUALITY  REVIEW  RESULTS 

SOFTWARE  IMPLEMENTATION  S2 

LANGUAGE  TYPE 
PROGRAM  SIZE 
MODULARITY 
EXTENT  OF  REUSE 
COMPLEXITY 

STANDARDS  REVIEW  RESULTS 

Rp  *  A  •  D  •  S  WHERE 

S  *  SI  •  S2 

St  *  SA  •  ST  «  SQ 

S2  =  SL  •  SS  •  SM  *  SU  •  SX  •  SR 


ESTIMA  TION  METRICS 

FAILURE  RATE  DURING  TESTING  F 

TEST  ENVIRONMENT  T 

TEST  EFFORT 
TEST  METHODOLOGY 
TEST  COVERAGE 

OPERATING  ENVIRONMENT  E 

WORKLOAD 
INPUT  VARIABILITY 

RE  ■  F  •  T,  DURING  TESTING  WHERE 

T  =  TE  •  TM  •  TC  and 

RE  -  F  •  E.  DURING  OT&E  WHERE 

E - EW  •  EV 


report  problems. 


e  The  metrlos  which  were  considered  predictive  were 

retained. 

e  The  metrlos  which  were  considered  to  be  QA/IVtfV  checklists 
candidates  are  advocated  as  review  checklists  to  be  used 
during  formal  reviews  such  as  design  reviews  and  informal 
reviews  suoh  as  walkthroughs. 

e  The  number  of  problem  reports  generated  as  a  result  of 
applying  these  checklists  Is  a  metric  to  be  used. 

Several  new  metrlos  were  identified  also  and  are  discussed  in  the 
following  paragraphs.  The  Predictive  Reliability  Figure-of -Merit 
(Rp)  is  the  product  of  the  identified  metrlos.  The  Individual 
metrlos  were  adjusted  during  validation  to  a  numeric  that  can  be 
used  as  a  multiplier  in  this  product.  The  final  results  are 
presented  in  Volume  II .  The  validation  prooess  is  described  in 
Section  9  of  this  Volume. 

3.8. 1. 1  Application  Type  (A) 

The  type  of  application,  i.e..  the  function  to  be  performed.  Is 
considered  a  basic  characteristic  of  the  software.  It  Is  con¬ 
sidered  in  this  study  as  the  basis  for  establishing  a  nominal 
prediction  number.  The  type  of  application  typically  affects 
both  the  manner  in  which  software  is  developed  and  how  it  Is 
operated.  Because  of  those  affects,  the  application  type  is  not 
independent  of  the  other  metrics  to  be  discussed.  However,  since 
it  is  perhaps  the  first  characteristic  known  about  the  software 
it  is  a  valuable  initial  predictor.  Our  concept  is  to  use  a 
classification  scheme  for  the  application  type.  A  fault  density 
(or  failure  rate)  will  be  associated  with  each  category  or 
application  type.  We  will  develop  that  metrio  by  looking  at  a 
wide  range  of  systems  and  talking  the  average  for  those  that  fall 
within  each  application  type.  The  metrio  will  be  a  fault  density 
associated  with  the  application  type  chosen,  A. 

Several  potential  classification  schemes  were  identified.  They 
are  presented  in  Table  3-3.  For  the  sake  of  this  study,  ve 

decided  to  evaluate  two  of  these  approaches.  Hecht ' s  basic 
categorization  was  real-time,  interactive,  batoh  processing  and 
support.  He  further  distinguishes  each  of  these  categories 
depending  on  aocess.  In  [MCCA77],  an  application  scheme  that  was 
Air  Force  application-related  was  developed.  This  scheme  was 
developed  to  be  oriented  toward  the  AF  SAM  or  SPO.  The  RCA 
PRICB-S  model  uses  the  classification  scheme  in  column  three  for 
the  parameter  PLATFORM  recognizing  the  influence  of  Military 
Standards  on  a  system.  The  PRICE-S  model  also  uses  an 
application  mix  for  the  software.  The  categorization  scheme  for 
this  mix  plus  the  relative  numerics  used  in  the  PRICB-S  system 
are  shown  in  Table  3-4.  The  RADC  Test  Handbook  [PRES84]  uses  the 
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•  REAL  TIMf  OPERATING 

system 

•  REAL  TIME  CLOSIO 
LOOP  OPERATING 
system 

■  OTHER  RIAL  TIMI 

■  interactive 
OPERATING  SYSTEM 

■  INTERACTIVE 
APPLICATION  - 
PUILIC 

•  INTERACTIVE 
application  - 
RESTRICTED 

•  SCIENTIFIC  EATCM 

•  OTHER IATCM 

■  SUPPORT  PROGRAM 

■  HAROMARE 

DIAGNOSTIC 

•  SOFTWARE  TOOLS 
ANO  DIAGNOSTICS 


•  MANNED  SPACECRAFT 
AIRIORNE  AVIONICS 

•  UNMANNED 

SPACECRAFT  MISSILES 

R  INDICATION  AND 
WARNING 

■  SENSOR  DATA 
PROCESSING' 
INTELLIGENCE 

•  STRATEGIC 
TACTICAL  c1 

•  COMMUNICATIONS 


•  DEVELOPMENT, 
TEST  IE 0 


•  MANNED  SPACECRAFT 

■  UNMANNED 
SPACECRAFT 

■  Mil  SPEC  AVIONICS 

•  COMMERCIAL 

AVIONICS 

•  MOttlE  SYSTEM 

■  NON  REAL  TIME  C1 

■  MU  SPEC 
CAOUNO  SYSTEM 

•  SATELLITE 
GROUNO  SYSTEM 

•  PRODUCTION 
CENTER  SOFTWARE 

-  CONTRACTOR 
JEVr LOPED 

•  PRODUCTION 
CENTER  SOFTWARE 

-  USER  OE VELOPEO 


•  EVENT  CONTROL 

•  PROCESS  CONTROL 

•  ''ROCEOURE 
CONTROL 

R  NAVIGATION 

•  flight  dynamics 

R  ORIITAL 
DYNAMICS 

•  MESSAGE  PROCESSING 

r  diagnostic 
software 

■  SENSOR  l  signal 
PROCESSING 

R  SIMULATION 


•  DATA  ACQUISITION 

•  DATA  PRESENTATION 

R  DECISIONS 

PLANNING  AIOS 

r  PATTERN  S  IMAGE 
PROCESSING 

R  COMPUTER  SYSTEM 
SOFTWARE 

R  SOFTWARE 

DEVELOPMENT  TOOLS 
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classification  scheme  in  column  four.  This  categorization 
relates  specifically  to  the  functions  being  performed  by  the 
software.  From  a  system  perspective,  there  are  typically  a 
number  of  these  functions  being  performed  within  a  system.  The 
two  approaches  chosen  for  evaluation  were  the  first  two.  Each 
was  modified  as  shown  in  Table  3-5. 

The  Air  Force  application  scheme  has  six  major  categories: 
airborne,  strategic,  tactical,  process  control,  production 
center,  and  developmental/ support .  Airborne  applications  are 
systems  which  perform  real-time  dosed  loop  functions  such  as 
navigation,  flight  control,  fire  control,  and  electronic  warfare 
on-board  an  aircraft .  Systems  on-board  a  satellite  performing 
orbital  control,  data  acquisition,  and  power  supply  control  would 
also  be  considered  airborne  systems.  Strategic  applications  are 
systems  involved  in  planning,  directing  or  providing  warning  of 
large-scale  military  operations.  An  industry  equivalent 
application  would  be  a  company  wide  communication  system 
supporting  business  management,  decision  support,  and  operation. 
Indication  and  warning  systems  like  a  ballistic  missile  defense 
system  are  considered  a  strategic  application.  Taotical 
applications  are  systems  involved  in  support  of  actual  enemy 
engagements  providing  such  functions  as  weapon  system  fire 
control,  short  range  communications,  and  oombat  deolslon  support. 
Process  Control  applications  are  systems  involved  in  monitoring 
and  controllng  machinery  such  as  numerical  control  manufacturing 
equipment  and  nuclear  power  plants.  The  production  center 
application  category  Involves  Managment  Information  Systems  such 
as  personnell,  finance,  payroll,  inventory  control  that  typically 
run  in  a  computer  center  environment  primarily  in  batch  mode. 
More  modern  examples  of  these  types  of  systems  are  on-line 
interactive  transaction  processing  systems.  The  Developmental 
Support  applications  category  includes  those  systems  which 
support  the  development  of  systems  (eg.  software  engineering 
environments),  simulations,  testbeds,  and  analytical  paokages. 
Examples  of  systems  which  would  fall  in  suoh  categories  is  shown 
in  Table  3-5.  These  examples  serve  as  definitions  of  the 
categories.  The  time  dependence  scheme  has  four  basic  categories 
of  real-time,  on-line  interactive  or  transaction  processing, 
batch,  and  support  software.  We  considered  subcategorizing 
real-time  into  close-loop  (eg.  flight  oontrol)  and  other  and 
on-line  into  distributed  and  centralized  to  evaluate  the 
differences  of  those  subcategories  but  postponed  that  for  future 
research . 

Table  3-5A  identifies  a  categorization  soheme  based  on  software 
function  [PRES84]  that  is  recommended  for  future  research.  This 
more  detailed  categorization  scheme  would  provide  a  nominal 
(baseline)  reliability  at  a  subsystem  or  CPC  level. 

Where  more  detailed  information  is  available,  we  could  further 
categorize  the  application  by  that  set  of  software  functions 
being  performed  and  the  time  dependency  of  these  functions.  We 
anticipate  that  we  will  eventually,  based  on  observed  data. 
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TABLE  3-5  APPLICATION  CLASSIFICATION  SCHEMES 


APPLICATION 


•  AIRBORNE  SYSTEMS 

-  MANNED  SPACECRAFT 

-  UNMANNED  SPACECRAFT 

-  MIL-SPEC  AVIONICS 

-  COMMERCIAL  AVIONICS 

•  STRATEGIC  SYSTEMS 

-  STRATEGIC  C2 

•  INDICATIONS  AND  WARNING 

-  COMMUNICATIONS 

•  TACTICAL  SYSTEMS 

-  TACTICAL  C2 

-  TACTICAL  MIS 

-  MOBILE 

-  EW/ECCM 

•  PROCESS  CONTROL  SYSTEMS 

-  INDUSTRIAL  PROCESS  CONTROL 

•  PRODUCTION  SYSTEMS 

•  MIS 

-  DECISION  AIDS 

-  INVENTORY  CONTROL 

-  SCIENTIFIC 

•  DEVELOPMENTAL  SYSTEMS 

-  SOFTWARE  DEVELOPMENT  TOOLS 

-  SIMULATION 

-  TESTBEDS 

-  TRAINING 


TIME  DEPENDENCE 


•  REAL-TIME 

•  ON-LINE  (INTERACnVE/TRAN S  ACTI 0 N 

PROCESSING) 

•  NON-TIME  CRITICAL  (BATCH) 

•  SUPPORT 


s  s 


TABLE  3-5A.  APPLICATION  CLASSIFICATION  SCHEMES 


_ FUNCTION _ 

•  EVENT  CONTROL 

•  PROCESS  CONTROL 

•  MESSAGE  PROCESSING 

•  SENSOR  AND  SIGNAL  PROCESSING 

•  PATTERN  AND  IMAGE  PROCESSING 

•  DISTRIBUTION/COMMUNICATION 

•  DISPLAY/DATA  PRESENTATION 

•  PROCEDURE  CONTROL 

•  RESOURCE  MANAGEMENT/CONTROL 

•  SCIENTIFIC/ ANALYTICAL  PROCESSING 

•  DECISION  AND  PLANNING  AIDS 

•  DATA  MANAGEMENT 

•  EXECUTIVE/OPERATING  SYSTEM 

•  SUPPORT  SOFTWARE/UTTLTTTES 

•  DIAGNOSTICS 
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modify  several  of  the  categories. 

As  the  development  proceeds  the  nominal  predicted  reliability  for 
the  application  will  be  modified  based  on  the  development 
environment,  the  characteristics  exhibited  by  the  software  as  it 
evolves,  and  its  performance  during  testing.  This  is  analogous 
to  the  procedure  used  for  hardware  reliability  prediction  where 
initially  a  nominal  parts  failure  rate  is  assigned  which  is 
modified  by  quality,  derating,  and  environment  factors  as  the 
design  is  definitized. 

3 . 3 . 1 . 2  Development  Environment  (D) 

This  metric  is  concerned  with  effects  of  the  development  environ¬ 
ment  on  the  reliability  of  the  software  produced  within  that 
environment .  In  the  development  of  the  COCOMO  software  cost 
model,  Boehm  found  that  there  were  significant  differences 
between  three  classes  of  environments  which  he  termed  organic, 
semi-detaohed,  and  embedded  [B0EH81].  It  is  expected  that  these 
environment  characteristics  will  also  affect  software  reliabil¬ 
ity. 

The  following  descriptions  of  each  of  the  environments  and  the 
table  of  distinguishing  features  (Table  3-6)  are  excepted  from 
the  cited  reference. 

ORGANIC  MODE  -  In  the  organic  mode,  relatively  small 
software  teams  develop  software  in  highly  familiar, 
in-house  environments.  Most  people  connected  with  the 
project  have  extensive  experience  in  working  with 
related  systems  within  the  organization,  and  have  a 
thorough  understanding  of  how  the  system  under  develop¬ 
ment  will  contribute  to  the  organization's  objeotlves. 

• 

SEMIDETACHED  MODE  -  The  semidetached  mode  of  software 
development  represents  an  intermediate  stage  between  the 
organlo  and  embedded  modes.  The  team  members  all  have 
an  Intermediate  level  of  experience  with  related 
systems.  The  team  has  a  wide  mixture  of  experienced  and 
inexperienced  people,  and  team  members  have  experience 
related  to  some  aspects  of  the  system  under  development, 
but  not  to  others. 

BMBEDDBD  MODE  -  The  major  distinguishing  factor  of  an 
embedded  mode  software  project  is  a  need  to  operate 
within  tight  constraints.  The  product  must  operate  (is 
embedded  in)  a  strongly  coupled  complex  of  hardware, 
software,  regulations,  and  operational  procedures  such 
as  electronic  funds  transfer  system  or  air  traffic 
control  system.  In  general  the  costs  of  changing  the 
other  parts  of  this  complex  are  so  high  that  their 
characteristics  are  considered  essentially  unchangeable, 
and  the  software  is  expected  both  to  conform  to  their 
specifications  and  to  take  up  the  slack  of  any  unfore- 
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SEMIDETACHED 


EM8ED0ED 


ORGANIZATIONAL  UNDERSTANDING 

THOROUGH 

CONSIDERABLE 

GENERAL 

OF  PRODUCT  OBJECTIVES 

EXPERIENCE  IN  WORKING  WITH 

EXTENSIVE 

CONSIDERABLE 

MODERATE 

RELATED  SOFTWARE  SYSTEMS 

NEED  FOR  SOFTWARE  CONFORMANCE 

BASIC 

CONSIDERABLE 

FULL 

WITH  PRE  ESTABLISHED  REQUIRE- 

MENTS 

NEED  FOR  SOFTWARE  CONFORMANCE 

BASIC 

CONSIDERABLE 

FULL 

WITH  EXTERNAL  INTERFACE 

SPECIFICATIONS 

CONCURRENT  DEVELOPMENT  OF 

SOME 

MODERATE 

EXTENSIVE 

ASSOCIATED  NEW  HARDWARE  AND 

OPERATIONAL  PROCEDURES 

NEED  TO  INNOVATE  DATA 

MINIMAL 

SOME 

CONSIDERABLE 

PROCESSING  ARCHITECTURES. 

ALGORITHMS 

PREMIUM  ON  EARLY  COMPLETION 

LOW 

MEDIUM 

HIGH 

PRODUCT  SIZE  RANGE 

<  50  KDSI 

<300  KDSI 

ALL  SIZES 

EXAMPLES 

BATCH  DATA 

MOST  TRANSITION 

LARGE.  COMPLEX 

REDUCTION 

PROCESSING 

TRANSITION 

SCIENTIFIC 

SYSTEMS 

PROCESSING 

MOOELS 

NEW  OS.  DBMS 

SYSTEMS 

BUSINESS 

AMBITIOUS 

AMBITIOUS 

MOOELS 

INVENTORY. 

VERY  LARGE 

FAMILIAR  OS. 

SIMPLE  COMMANO 

OS 

COMPILER 

CONTROL 

AVIONICS 

SIMPLE 

AMBITIOUS 

INVENTORY, 

COMMAND- 

PRODUCTION 

CONTROL 

CONTROL 
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seen  difficulties. 

A  metric,  Dj_,  will  be  associated  with  each  of  these  three 
environments.  That  metric  will  be  modified  based  on  further 
distinguishing  characteristics  shown  in  Table  3-7.  These 
characteristics  further  distinguish  the  level  of  formality, 
discipline,  and  modern  approach  to  the  development  effort 
[SOIS8B].  The  characteristics  will  be  in  the  form  of  a  checklist 
which  will  be  used  to  score  the  development  enviroment.  The 
score  will  modify  the  initial  environment  metrio,  Dj_,  resulting 
in  the  metric  D.  This  resulting  metrio,  D,  will  be  a  multiplier 
of  the  fault  density  associated  with  the  Application  Type  and 
affeot  it  positively  (the  multiplier  will  be  less  than  one  but 
greater  than  zero)  or  negatively  (the  multiplier  will  be  greater 
than  one),  thus  representing  the  positive  or  negative  effect  the 
development  environment  has  on  the  production  of  reliable  soft¬ 
ware. 

3.8. l.S  Software  Characteristics  (S) 

This  set  of  metrics  represent  those  characteristics  of  the 
software  which  are  likely  to  affect  the  software  reliability. 
The  characteristics  can  be  measured  from  the  oode  and  the  docu¬ 
mentation  produced  during  the  software  development  process.  The 
metrics  within  this  set  are  further  organized,  for  recognition 
purposes,  under  Requirements  and  Design  Representation  metrics 
and  Software  Implementation  metrics.  Those  metrios  in  the  former 
group  are  applied  to  the  documentation  which  represents  the 
software  requirements  of  the  system  and  the  software  design. 
They  will  typioally  be  applied  at  the  time  of  formal  reviews  such 
as  the  Software  Requirements  Review  (SRR) ,  the  Preliminary  Design 
Review  ( PDR)  and  the  Critical  Design  Review  (CDR).  Those  metrics 
in  the  latter  group  are  applied  to  the  code  during  the  coding 
phase  of  the  development.  Each  metrio  is  described  in  the 
following  paragraphs. 

3.3. 1.3.1  Requirements  and  Design  Representation  Metrios  (Si) 

•  Anomaly  Management  (SA) 

This  metrio  represents  the  degree  to  which  fault  tolerance 
has  been  designed  and  implemented  in  the  system.  The 
ability  of  the  software  to  accept  anomalous  input  data, 
reoover  from  incorrect  calculations,  gracefully  degrade, 
and  fail  in  a  controlled  manner  contributes  to  its 
reliability.  Various  strategies  for  developing  error 
tolerance  software  exist  [MYER76],  A  checklist  approach 
to  evaluating  these  features  was  first  proposed  by 
( MCCA77 ]  and  expanded  by  [BOWE83] .  The  features  assessed 
include : 

Error  Condition  Control 
Input  Data  Checking 
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DISTINGUISHING  CHARACTERISTICS  OF 
DEVELOPMENT  ENVIRONMENT  (Modified  from  [S01585]) 


ORGANIZATIONAL/ PERSONNEL  CONSIDERATIONS 

Separate  Design  and  Coding 
Independent  Test  Organization 
Independent  Quality  Assurance 
Independent  Configuration  Mangement 
Independent  Verification  and  Validation 
Chief  Programming  Teams 

Above  Average  Educational  Level  of  Team  Members 
Above  Averrage  Experience  Level  of  Team  Members 

METHODS  USED 

Definition/Enforcement  of  Standards 
Use  of  HOL 

Formal  Reviews  ( SRR ,  PDR,  CDR ,  etc.) 

Frequent  Walkthroughs 
Top  Down  and  Structured  Approaches 
Unit  Development  Folders 
Software  Development  Library 
Formal  Change  and  Error  Reporting 
Progress  and  Status  Reporting 

DOCUMENTATION 

System  Requirements  Specification 
Software  Requirements  Specification 
Interface  Design  Specification 
Software  Design  Specification 
Test  Plans,  Procedures  and  Reports 
Software  Development  Plan 
Software  Quality  Asssurance  Plan 
Software  Configuration  Management  Plan 
Requiremetns  Traceability  Matrix 
Version  Description  Document 
Software  Discrepancy  Reports 

DEVELOPMENT..  TOOLS 

Requirements  Specification  Language 
Program  Design  Language 

Program  Design  Graphical  Technique  (Flowchart, 
HIPO,  etc) 

Simulation  /Emulation 
Configuration  Management 
Code  Auditor 
Data  Flow  Anallyzer 
Quality  Measurement  Tools 
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-  Computational  Failure  Identification  and  Recovery 

-  Hardware  Fault  Identification  and  Recovery 

-  Devioe  Error  Identification  and  Recovery 

-  Communication  Failure  Identification  and  Recovery 


The  metric ,  SA,  is: 


SA  -  ka/AM 


where  ka  is  a  coefficient  to  be  derived  from  regression 
and  AX  is  the  evaluated  score  from  application  of  the 
oheoklists  in  [BOWE83]  (metrics  AM.l,  AM. 2.  AM. 3,  AM. 4, 
AM. 5,  AM. 6,  AM. 7,  RB.l). 

The  checklists  have  been  modified  somewhat  during  the 
prooess  of  use/experience  during  this  effort .  They  are 
presented  in  the  Data  Collection  Procedures,  Appendix  B  of 
Volume  II  of  this  report . 

Traceability  (ST) 

The  traceability  metrio  is  based  on  an  identically  named 
criterion  in  [MCCA80]  and  [BOWE85].  The  metric  used 
there,  the  cross  reference  relating  modules  to  require¬ 
ments,  will  also  be  applied  to  the  current  study.  The 
basio  ooncept  of  this  criterion  is  that  if  the  require¬ 
ments  are  traceable  to  the  code  then  there  is  less  of  a 
chance  that  a  misinterpretation  of  the  requirements  can 
result  in  a  fault  in  the  code. 

The  effect  on  reliability  will  be  represented  by  the 
traoeabllity  metrio,  ST,  as: 

ST  -  kt0/TC 


where  k-to  represents  a  coefficient  to  be  determined  by 
regression  and  TC  is  the  traceability  metrio  (TC.l)  in 
Table  3-1,  whloh  is  calculated  by  Identifying  the  total 
number  of  requirements  (NR)  and  dividing  this  number  by 
the  total  number  of  traoeable  requirements  (NR-DR)  where 
DR  is  the  number  of  requirements  not  traoeable  to  design 
or  oode.  A  methodology  for  itemizing  requirements  can  be 
found  in  [HERN83]  or  use  of  tools /techniques  such  as  SREM 
(BBLL763  or  PSL/PSA  [TEIC70]  also  support  this  type  of 
calculation.  A  further  description  of  how  to  calculate 
the  metrio  is  in  Volume  II  of  this  report. 
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•  Quality  Review  Results  (SQ) 


During  most  large  system  developments  various  formal 
revievs  are  conducted.  Previously  mentioned  examples  such 
as  SRR,  PDR ,  CDR  are  typical  formal  revievs.  Informal 
revievs,  audits,  or  inspections  may  also  be  conducted. 
Two  such  techniques  are  structured  walkthroughs  and  design 
and  oode  inspections  [FAGA70] .  The  quality  of  the  docu¬ 
mentation  and  the  design  represented  by  the  documentation 
is  reviewed  during  these  aotivlties.  Any  problems 
identified  are  reoorded  as  a  problem  report  or  action  item 
for  oorreotion.  Studies  have  shown  that  the  more  problems 
encountered  early  in  a  development  the  more  likely  it  is 
that  problems  will  exist  and  be  found  later  during  test 
and  operation  [LIP079].  This  metrlo,  Quality  Review 
Results  (SQ),  represents  a  measure  of  the  number  of 
problem  reports  or  discrepancies  reported  during  reviews. 
The  metric  takes  the  following  form: 

SQ  -  kff  •  (NR/NR-NER) 


where  kq  is  a  coefficient  derived  from  regression  (see 
Seotion  n),  NDR  is  the  number  of  discrepancy  reports 
identified,  and  NR  is  the  total  number  of  requirements 
identified  in  the  system. 

Use  of  the  worksheets  (checklists)  in  Appendix  D  of  Volume 
II  is  advocated.  These  worksheets  contain  data  elements 
related  to  the  software  quality  metrics  in  Table  3-1: 

Accuracy  (AC.l) 

Completeness  (CP.l) 

Consistency  (CS.l,  CS.2) 

Autonomy  (AU.l,  AU.2) 

A  dlsorepanoy  report  should  be  generated  for  each  question 
on  these  worksheets  answered  negatively  when  applicable. 
An  example  discrepancy  report  is  shown  in  Figure  3-4. 

The  worksheets  assess  how  well  the  following  character¬ 
istics  have  been  addressed  in  the  requirements  and  design 
of  the  system. 

-  Aoouraoy  -  the  concept  of  reliability  includes  pre¬ 
cision,  i.e.,  algorithms  must  be  aocurate  within 
certain  bounds. 

-  Completeness  -  the  requirements  and  design  should  have 
the  following  characteristics: 

—  Unambiguous  references. 
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PROBLEM  TITLE:. 


PROGRAM  CD: 
REFERENCES: . 


REQUIREMENTS 

•  Incorrect  Spec 

•  Conflicting  Spec 

•  Incomplete  Spec 


CRITICALITY 
HIGH  _ 


METHOD  DETECTION 


DESCRIPTION  OF  PROBLEM: 


DESIGN 

Requirements  Compliance 
Choice  of  Algorithm 
Sequence  of  Operations 
Data  Definitions 
Interface 


PROBLEM  NUMBER:. 

DATE: - 

ANALYST: _ 


PROBLEM  TYPE: 

CODING 

Requirements  or  Design  •  Omitted  Logic 

Compliance  ■  Interface 

Computation  Implementaaon  •  Performance 
Sequence  of  Operation 
Data  Definition 
Data  Handling 


MAINTENANCE 

•  Incorrect  Fix 

•  Incompatible  Fix 

OTHER 


MEDIUM 


TEST  EXECUTION: 
EFFECTS  OF  PROBLEM: 


RECOMMENDED  SOLUTION: 


TEST  CASE  ID: 


TEST  EXECLTION  TIME: 


APPROVED: 
DATE: - 


RELEASED  BY: 
DATE:  - 


FIGURE  3-4  DISCREPENCY  REPORT 
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All  data  references  defined,  computed,  or  obtained 
from  an  external  source, 

--  All  defined  functions  used, 

—  All  referenced  functions  defined, 

All  conditions  and  processing  defined  for  each 
decision  point. 

All  defined  and  referenced  calling  parameters 
agree ,  and 

—  All  discrepancy  reports  resolved. 

-  Consistency  -  the  requirements  and  design  should  have: 

Standard  design  representation, 

—  Calling  sequence  conventions, 

Input /output  conventions, 

—  Data  naming  conventions,  and 
—  Error  handling  conventions. 

-  Autonomy  -  the  software  components  should  be  indepen¬ 
dent  functions  and  as  non-dependent  of  their  interfaces 
as  possible. 

In  order  for  this  metrio  to  take  on  true  significance,  statisti¬ 
cal  studies  of  projects  employing  similar  review  concepts  or  at 
least  devoted  similar  levels  of  effort  to  reviewing  the  require¬ 
ments  and  design  will  have  to  be  conducted.  Projects  employing 
iv»v  contractors  would  be  applicable  subjects. 

3.3. 1.3.2  Software  Implement at ion  Metrics  (S2) 

•  Language  Type  (SL) 

The  programming  language  chosen  and  used  to  implement  a 
system  can  have  an  effect  on  the  reliability  of  the 
system.  A  significant  dependency  of  fault  density  on 
language  has  been  established  in  [HECH83] . 

The  metrio  for  Language  (SL)  will  be  based  on  the  classi¬ 
fication,  identified  as: 

Assembly  level  programs,  and 

Higher-order  language  programs. 

The  HOL  category  will  represent  the  default  (to  be 


assigned  a  value  of  1).  It  has  been  assumed  that  one  HOL 
statement  will  generate  machine  instructions  equivalent  of 
two  to  eight  assembly  statements.  Five  is  a  typical 
expansion  ratio  for  FORTRAN.  Under  these  circumstances 
the  metrio  is: 

SL(Assembly)  -  1.4 


SL(HOL)  -  1 


Where  programs  contain  a  mixture  of  HOL  and  assembly 
language  code,  the  language  criterion  is  computed  as  the 
sum  of  the  fractions  applicable  to  each  category.  Thus, 
for  a  mixed  language  program,  the  language  metric,  SL,  is 
given  by 

SL  -  (HOL%)  *1  +  (Assembly  %)  *1.4 
Program  Size  (SS) 

This  metrio  represents  the  effect  of  total  size  on  reli¬ 
ability.  We  already  stated  that  the  failure  rate  measure 
of  reliability  is  self -normalizing  with  respect  to  size, 
however  we  feel  there  are  secondary  effects  which  should 
be  taken  into  account.  These  secondary  effects  are 
associated  with  inherent  complexity,  number  of  interac¬ 
tions,  data  base  size  and  the  ability  of  humans  to  deal 
with  extremely  large  systems. 

The  metrio  will  be  a  multiplier  associated  with  size  cate¬ 
gories  (or  ranges).  Tentatively  size  categorizations  to 
be  used  are: 

SS(1)<  10000  lines  of  code 
10000  <SS(2)<  50000  lines  of  code 
50000  <SS(3)<  100000  lines  of  code 
100000  <SS(4) 

In  this  case,  lines  of  code  are  defined  as  all  executable 
souroe  statements. 

Modularity  (SM) 

It  is  generally  held  that  small  modules  can  be  more 
readily  reviewed  and  are,  therefore,  less  likely  to 
contain  faults  than  larger  modules  (this  is  implicit  in 
MIL-STD-1679) .  It  is  intended  to  establish  three  cate¬ 
gories  for  module  size,  based  on  the  number  of  executable 
statements : 

SM(1)  <  200  lines  of  code 
200  <  SM(2)  <  3000  lines  of  code 


Por  the  assessment  of  software  development  practices  it 
might  be  of  interest  to  apply  this  metric  to  individual 
modules  and  to  correlate  it  with  failures  due  to  these 
modules.  In  many  cases,  available  data  from  historical 
projeots  do  not  support  an  analysis  at  this  detailed 
level.  Regardless  of  data  quality.  it  is  frequently 
impossible  to  associate  a  specific  module  with  a  software 
failure  Ce.g.,  for  failures  due  to  missing  requirements, 
faulty  interface  specifications  or  implementations).  For 
cases  where  detailed  data  is  available,  the  metric  will  be 
evaluated  by  the  following : 

SM  -  (u* SM(  1 )  1-  v*  SM(  2  )  +  w *  SM(  3 )  )  /  (u+v+w) 

where  SH  is  the  overall  module  size  metric,  lower  case 
letters  are  the  number  of  modules  in  a  given  category  and 
upper  case  letters  are  the  module  size  coefficients 
applicable  to  each  category. 

For  the  purpose  of  reliability  prediction,  for  this  study, 
it  is  considered  adequate  to  base  the  metric  for  module 
size  on  the  average  size  in  a  program  (i.e.,  total  execut¬ 
able  statements  divided  by  the  number  of  modules).  The 
metric,  SM,  applicable  to  each  module  size  classification 
was  evaluated  by  regression  (see  Section  5). 

Extent  of  Reuse  (SU) 

As  the  application  of  computers  to  Air  Force  projects 
matures,  there  are  increasing  opportunities  for  including 
portions  of  operational  code  in  new  software  developments. 
The  practice  appears  desirable  for  reliability  as  well  as 
for  eoonomio  reasons  Code  from  current  operational 
programs  is  expeoted  to  contain  fewer  faults  than  newly 
generated  oode  since  through  previous  test  and  maintenance 
efforts  its  reliability  will  have  grown  to  an  acceptable 
level.  The  reliability  of  the  current  code  is  assumed  to 
be  known  by  observation  during  operation. 

However,  it  is  important  to  recognize  any  differences  in 
environment,  application,  or  interfaces  that  the  existing 
software  may  encounter  will  have  a  potential  impact  on  its 
reliability.  In  the  situation  where  new  code  is  being 
added  to  existing  code  in  the  same  environment,  the 
existing  code's  reliability  can  be  taken  as  observed.  In 
the  situation  where  the  existing  oode  is  being  used  in  a 
new  environment  as  part  of  the  development  of  a  new 
application,  it  cannot  be  expected,  without  analysis,  to 
perform  with  its  established  reliability  because  of  new 
requirements  and  interfaces.  In  each  case,  though,  the 
failure  rate  for  the  reused  code  should  be  lees  than  that 
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for  the  new  code.  The  metric  for  reused  code  (SD)  in 
reliability  prediction  will  be: 


SU  -  SD(i) 

where  Su  (i)  is  a  factor  derived  from  empirical  data. 

Initially  we  expect  this  factor  to  be  determined  by 
looking  up  a  factor  in  a  Table  with  data  from  a  limited 
number  of  projects. 

•  Complexity  (SX) 

Candidate  metrics  include  the  SI. 3  and  SI. 4  metrics  from 
[BOWB83]  (see  Table  3-1).  SI. 3  is  McCabe's  cyclomatic 
complexity  metrio  [MCCA76]  and  SI. 4  is  the  checklist 
assessing  the  simplicity  with  whioh  a  program  is  imple¬ 
mented.  Halstead's  metrics  (SI. 6)  should  also  be 

oonsidered  (HALS77] .  Past  experienoe  applying  these 
metrios  indicates  McCabe's  metrio  to  be  more  applicable 
beoause  it  oan  be  automatically  calculated  and  has  demon¬ 
strated  better  correlation  than  Halstead's  metric. 
[MCCA601 . 

Since  this  metric  is  applied  when  the  project  is  close  to 
entering  the  reliability  estimation  phase,  prediction  that 
accounts  for  complexity  may  be  helpful  in  several  ways: 

It  will  identify  the  role  that  complexity  plays  in 
causing  failures  (by  use  of  regression  techniques). 

It  will  encourage  recording  of  complexity  measures  as 
part  of  the  project  history. 

By  virtue  of  the  above  it  will  identify  long  range 
trends  of  increasing  or  decreasing  complexity  which  may 
not  otherwise  be  captured  in  an  analysis  of  software 
failures. 

This  metrio  is  applicable  at  the  module  level.  Again,  the 
availability  of  data  at  this  level  may  hinder  the  estab¬ 
lishment  of  a  prediction  coefficient  and  use  of  the  metric 
during  projects.  Vhen  available  the  metric  (SX)  will  be: 

n 

SX  -  kx  •  (  r  SXi  )  /  n 

i-1  ; 

where  SXj_  is  McCabe's  complexity  (SI.  3  in  Table  3-1)  for 
eaoh  module.  1,  in  the  system,  n  equals  the  total  number 
of  modules  in  the  system,  and  kx  is  a  ooeffloient  derived 
from  regression. 
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•  Standards  Review  Results  (SR) 


As  during  requirements  and  design,  reviews,  audits, 
inspections  and  walkthroughs  are  techniques  for  identify¬ 
ing  discrepancies  or  problems  to  be  corrected.  This 
metric  represents  the  number  of  problems  identified  per 
module  based  on  reviews  or  audits  of  the  code. 

Worksheets  from  software  quality  metrios  (SI.l,  SI. 2, 
SI. 4,  SI. 5,  MO . 1 ,  MO. 2)  are  advocated.  Enforcement  of 
programming  standards  is  another  technique  when  discrepan¬ 
cies  would  be  identified.  Worksheets  are  in  Appendix  D  of 
Volume  II.  The  overall  metrio  then  will  be  a  composite, 
based  on  the  evaluation  of  the  following  characteristics: 

-  Design  organized  in  top-down  fashion. 

Independence  of  module. 

Module  processing  not  dependent  on  prior  processing. 

Each  module  description  includes  input,  output,  pro¬ 
cessing,  limitations, 

Each  module  has  a  single  entrance,  single  exit, 

Size  of  data  base, 

-  Compartmentalizatlon  of  data  base, 

-  No  duplicate  functions,  and 
No  global  data. 

The  metric  will  be: 


SR  -  kv  •  Cn/n-PR) 


where  n 
PR 

kv 


number  of  modules 

number  of  problem  modules  identified  with 

severe  discrepancies 

coefficient  derived  by  regression 
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Classification  of  the  types  of  problems  being  identified  can  be 
helpful.  Three  problem  classification  schemes  are  shown  in  Table 
3-8.  The  middle  column,  has  been  used  most  widely  in  the  past. 
The  right  hand  column  is  the  one  advocated  primarily  because  cf 
its  development  phase  orientation.  By  looking  at  the  types  of 
errors  being  identified,  standards  can  be  improved,  checklists 
can  be  improved,  and  development  techniques  oan  be  improved  to 
help  avoid  making  similar  errors  in  the  future. 
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3.3. 1.4  Other  Metrics 

Two  other  quality  metrics  identified  in  Table  3-1,  Self-Descrip- 
tiveness  and  Distributedness,  were  not  used.  Self-Descriptive- 
ness  seemed  particularly  applicable  to  maintainability  and  not 
appropriate  for  reliability  prediction.  Distributedness  is 
appropriate  for  distributed  systems  and,  therefore,  a  special 
case  not  applicable  to  our  generio  methodology. 

Visibility,  a  quality  metric  identified  in  Table  3-1,  is  appro¬ 
priate  as  an  estimation  metric  and  discussed  in  subsequent 

paragraphs. 

3.3.2  Estimation  Metrics 

As  previously  discussed,  the  use  of  reliability  model  technology 
has  not  been  widely  accepted.  The  basic  approach  of  this 
technology,  observing  the  failure  rate  of  the  software  during 
test,  will  be  used  within  our  methodology.  Our  approach  to 
estimation  is  to  observe  testing  and  calculate  the  observed 
failure  rate  of  the  software.  This  basio  estimation  number  will 
he  adjusted  based  on  one  of  two  environmental  metrics,  T  during 
the  development  test  phases  and  E  during  the  Operational  Test  and 
Evaluation  phase.  The  estimation  number  will  be  the  product  of 
the  observed  failure  rate  and  one  of  those  metrics.  These 
metrics  are  described  in  the  following  paragraphs. 

3.3.2. 1  Failure  Rate  During  Test  (F) 

The  basio  metric  for  estimation  will  be  the  observed  failure  rate 
during  testing  (F).  Reliability  models  have  been  researched  for 
a  number  of  years  and  provide  a  mechanism  for  estimation.  The 
basic  philosophy  of  the  reliability  models  is  illustrated  in 
Figure  3-5  (using  the  Musa  Model  as  an  example)  [MUSA73 ] .  The 
observed  number  of  failures  over  time  (and  therefore  the  mean 
time  between  failures)  is  extrapolated  via  a  ourve  fitting 
exercise  (using  the  basic  assumed  model)  and  knowing  the  amount 
of  test  time  expended  to  date,  one  oan  estimate  the  amount  of 
additional  test  time  required  to  achieve  an  acceptable  (esti¬ 
mated)  failure  rate.  A  large  number  of  models  exist.  Twenty 
three  models  described  in  [GOBL83]  are  listed  in  Table  3-9. 
Experience  using  these  models  has  varied  ( [MTJSA79]  ,  [RICH83], 

[ ANGU83 ] )  and  because  of  that  variability ,  make  the  models 
suspect  as  estimation  techniques.  In  lieu  of  their  use,  tracking 
the  observed  failure  rate  during  testing  provides  a  basis  for 
estimation.  This  is  illustrated  in  Figures  3-8  and  3-7.  Figure 
3-6  demonstrates  the  use  of  execution-time  measures  during  the 
pre-operational  (test)  phase  [HECH77] .  The  data  came  from  the 
development  of  the  Metrio  Integrated  Processing  System  (MIPS)  at 
Vandenberg  Air  Force  Base  during  which  disciplined  programming 
techniques  were  introduced  under  an  RADC  sponsored  effort.  The 
linear  regression  line  exhibits  an  improvement  in  reliability 
(reliability  growth)  over  time  (the  downward  slope).  It  also 
shows  several  significant  increases  in  failure  rate  during 


speolfio  months .  In  each  case  there  was  always  a  specific 
reason:  In  May  and  August  1976  major  new  modules  were  added  to 
the  system  under  test;  in  Ootober  1976,  the  contractor's  quality 
assuranoe  organization  took  over  responsibility  for  the  test;  and 
January  1977  marked  the  start  of  testing  by  the  Air  Force. 

Similar  oonsistenoy  in  time  for  this  type  of  metric  during 
operation  is  shown  in  Figure  3-7  [MUSA79] .  The  failure  rate  is 
indioated  by  the  slope  of  the  data  line.  Mote  that  the  ordinate 
soale  is  nonlinear  in  order  to  permit  the  number  of  failures 
predioted  by  the  MUSA  model  to  be  plotted  as  a  straight  line.  A 
last  example  is  provided  in  Figure  3-8  from  [ ANGU79] .  In  this 
example,  a  consistent  reliability  growth  was  not  observed.  A  high 
failure  rate  vas  still  being  observed  at  the  end  of  the 
illustrated  test  phase. 

By  tracking  this  metric  during  testing,  the  trend  in  the  observed 
failure  rate  oan  be  monitored  and  used  as  the  basis  for  estimat¬ 
ing  what  the  expeoted  operational  reliability  will  be. 

3. 3. 2. 2  Test  Environment  (T) 

Several  characteristics  of  the  test  environment  should  be 
accounted  for  in  the  estimation  of  reliability.  The  observed 
failure  rate  may  not  accurately  represent  what  the  operational 
reliability  will  be  because: 

e  The  test  environment  does  not  accurately  represent  the 
operational  environment , 

•  The  test  data  does  not  thoroughly  exercise  the  system 
thereby  leaving  untested  many  segments  of  the  code, 

•  The  testing  techniques  employed  do  not  thoroughly  test  the 
system,  and 

e  The  amount  of  testing  time  does  not  thoroughly  test  the 
system. 

These  characteristics  are  taken  into  account  by  the  metrics  to  be 
discussed  in  this  paragraph.  In  eaoh  case  the  metrics  will  be  in 
the  form  of  a  multiplier,  the  product  of  all  of  these  to  be  used 
to  adjust  the  observed  failure  rate  (F)  up  or  down  depending  on 
the  level  of  confidence  in  the  representativeness  and  thorough¬ 
ness  of  the  test  environment  (T  -  TE*TM*TC) . 

e  Test  Effort  (TB) 

This  metrio  is  intended  to  represent  the  amount  of  effort 
applied  to  testing.  Three  alternatives  are  to  be  eval¬ 
uated.  The  first  alternative  is  the  test  budget  (dollars 
or  labor  hours)  whioh  would  appear  to  be  a  good  metrio  for 
the  amount  of  test.  Comparison  with  a  guideline  of  40%  of 
total  development  effort  would  be  the  metric.  However, 
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there  are  considerable  difficulties  in  obtaining  credible 
figures  on  this,  particularly  where  parts  of  the  test  were 
oonduoted  by  the  developer  and  other  parts  by  the  Govern¬ 
ment  or  a  separate  contractor.  Also,  because  test  is  the 
projeot  activity  most  likely  to  be  under  budget  and 
schedule  pressure,  substantial  parts  of  test  are  sometimes 
conducted  as  a  supplemental  project  for  which  data  are  not 
recorded  in  the  main  project  records. 

A  seoond  alternative  is  the  total  oalendar  time  devoted  to 
test  for  use  as  a  comparison  among  projects  of  approxi¬ 
mately  equal  size.  Normalization  by  dividing  by  total 
lines  of  oode  may  be  inappropriate  because  of  non- 
linearities  affecting  large  projects.  However,  normalized 
oalendar  time  will  be  evaluated  as  a  metrio  for  the  amount 
of  test  during  this  study. 

As  a  third  alternative,  the  number  of  separate  test  teams 
involved  will  be  evaluated.  In  a  major  project,  the 
following  may  be  responsible  for  major  phases  of  software 
test : 

-  Software  Developer, 

-  Developer's  Software  Test  or  QA  Staff, 

System  Integrator, 

-  Independent  Validation  Contractor. 

-  Air  Poroe  Test  Agent  (Air  Force  Operational  Test  and 
Evaluation  Command) , 

-  Sponsor  (Air  Force  Systems  Command) ,  and 

End  User  (Air  Force  Operational  Command) . 

The  more  teams  involved,  the  more  thoroughly  the  system 
will  be  tested.  The  metrio,  TE,  will  be  examined  in  these 
three  forms  during  the  validation  phase  of  the  project  and 
the  fora  whioh  exhibits  the  best  results  will  be  chosen. 
The  three  forms  are: 

Cl)  TE  -  40/ AT 

where  AT  -  the  percent  of  the  development  effort  devoted 
to  testing. 

(2)  -  40/AT 

where  AT  -  the  percent  of  the  development  schedule  devoted 
to  testing. 
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(3)  -  £  TT(i) 
i 

where  TT  is  a  factor  (to  be  determined  by  regression) 
associated  with  each  test  team  mentioned  above  and  n  is 
the  number  of  test  teams  applied. 

Test  Methodology  (TM) 

The  test  methodology  used  is  another  element  by  which  to 
assess  the  thoroughness  of  testing.  One  measure,  TM,  that 
suggests  itself  is  the  use  of  test  tools  and  testing 
techniques.  In  most  cases  the  tools  are  being  operated  by 
a  staff  of  specialists  who  are  also  aware  of  other 
advances  in  software  test  technology.  The  primary 
emphasis  will  be  on  classifying  the  test  environment  by 
the  tools  and  techniques  used.  Distinctions  based  on  the 
type  of  test  tools  and  techniques  used  will  be  made. 

A  technique  and  handbook  for  doing  this  assessment  Cor 
classification)  has  been  developed.  In  the  Software 
Test  Handbook  [PRES84] .  a  technique  to  determine  what 
tools  and  techniques  should  be  applied  to  a  specific 
application  is  provided.  That  technique  is  illustrated  in 
Figure  3-9  and  results  in  a  reoommended  set  of  testing 
techniques  and  tools.  Our  approach  will  be  to  use  that 
recommendation  to  evaluate  the  techniques  and  tools 
applied  on  a  particular  development.  This  evaluation  will 
result  in  a  score  that  will  be  the  basis  for  this  metric 
as  follows: 


TM  -  kt  *  TR/TU 


where  TU  is  the  number  of  tools  and  techniques  used  and  TR 
is  the  number  recommended,  k^  is  a  constant  determined  by 
regression. 

The  tool  and  technique  checklist  in  [ PRES84]  is 
specifically  to  be  used  to  assess  testing.  The  tool  and 
technique  checklist  shown  earlier  (Table  3-7)  was  for  the 
development  phases  of  requirements,  design,  and  coding. 

•  Test  Coverage  (TC) 


This  metric  assesses  how  thoroughly  the  software  has  been 
ezeroised  during  testing.  If  all  of  the  code  has  been 
exercised  then  there  is  some  level  of  confidence  estab¬ 
lished  that  the  code  will  operate  reliably  during  opera¬ 
tion.  Typloally  however,  test  programs  do  not  maintain 
this  type  of  information  and  a  significant  portion  (up  to 
40%)  of  the  software  (especially  error  handling  code)  is 
never  tested.  Tools  such  as  JAVS,  FAVS,  and  CAVS 
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(developed  under  RADC  contracts)  provide  such  information. 

This  metrio  could  be  calculated  in  three  ways  depending  on 
the  phase  of  testing  as  follows: 

TC  -  ktc  *  1/VS 


where  1b  a  constant  determined  by  regression 

VS  -  VS1  during  unit  testing 

-  VS2  during  integration  testing 

-  VS3  during  system  testing 

and 

VS1  -  (PT/TP  +  IT/TI)/2 

where  PT  -  execution  branches  tested 
TP  -  total  execution  branches 
IT  -  input  tested 

TI  -  total  number  of  inputs 

VS2  -  (MT/TM  +  CT/TC)/2 

MT  -  units  tested 

TM  -  total  number  of  units 

CT  -  interfaces  tested 

TC  -  total  number  of  interfaces 

VS3  -  RT/NR 

RT  -  Requirements  tested 

NR  -  total  number  of  requirements 

3. 3. 2. 3  Operating  Environment  (B) 

Several  oharaoteristics  of  the  operational  environment,  experi¬ 
enced  during  OTfifB,  should  be  accounted  for  in  estimating  relia¬ 
bility.  Again,  during  OTVB  we  are  trying  to  extrapolate  the 

observed  failure  rate  (F)  into  operations.  The  characteristics 
we  want  to  aocount  for  are  the  workload  and  the  variability  of 
inputs.  These  two  characteristics,  for  which  we  have  developed 
metrics,  represent  the  stress  of  the  operational  environment  on 
the  software.  The  metrics  will  be  multipliers  which  will  raise 
or  lower  the  estimated  failure  rate  depending  on  the  degree  of 
stress  (B  -  B¥  •  BV). 

•  workload  (BW) 

The  relationship  between  the  workload  and  software  failure 
rate  has  been  investigated  at  Stanford  University  and  a 
very  significant  positive  correlation  has  been  reported 
[ROSS82] .  The  baslo  concept  underlying  this  phenomena  is 
that  more  unusual  situations  (program  swapped  in  and  out 
of  memory,  queued  I/O,  wait  states,  etc.)  are  encountered 
in  a  heavy  workload,  and  the  application  programmer  may 
not  have  anticipated  all  the  situations.  In  addition, 
system  software  will  tend  to  fail  more  often  when  used 
more  often. 


The  measured  workload  will  be  transformed  into  a  stress 
aetrio  as  follows: 

EW  -  kew  »  ET/CET-OS) 


where  OS  is  the  amount  of  Operating  System  overhead  used. 
BT  is  the  total  execution  time,  kew  is  a  constant  deter¬ 
mined  by  regression.  This  form  of  relationship  (linear) 
will  be  developed  if  applicable.  If  not  a  more  general 
relationship,  EW  -  f  (OS),  will  be  developed. 

The  use  of  operating  system  overhead  was  ohosen  because  it 
is  usually  available.  Other  alternatives  are  number  of 
system  calls  per  minute,  number  of  paging  requests,  and 
number  of  I/O  operations. 

Variability  of  Input  (EV) 

Variability  of  the  input  is  the  primary  determinant  of 
software  reliability  in  some  models,  such  as  the  ones 
proposed  by  Nelson  and  Lipow  [DACS79]  and  Roger  Cheung 
[CHEU81].  The  basic  concept  here  is  that  the  greater  the 
variability  of  inputs  to  the  program  the  more  likely  an 
unanticipated  input  will  be  encountered  and  the  program 
will  fall.  Neither  one  of  these  models  is  supported  by 
sufficient  data  to  permit  direct  evaluation  of  the  effect 
of  variability  on  failure  frequency,  however.  Nelson  and 
Lipow  proposed  partitioning  of  the  input  data  set,  and  an 
index  of  variability  can  then  be  derived  from  the  number 
of  partitions  accessed  during  one  time  period  or  one  run. 
This  appears  practical  in  only  a  very  limited  number  of 
applications.  Cheung  uses  the  calling  sequence  as  an 
indicator  of  variability,  a  somewhat  more  easily  lmple 
mented  measure,  but  still  targeted  primarily  to  a  research 
environment.  It  is  proposed  to  use  the  frequency  of 
exception  conditions  as  a  practical  measure  of  variability 
in  the  current  eifort.  The  monitoring  of  exception 
conditions  is  accomplished  by  hardware  provisions  which 
are  incorporated  in  many  current  computers.  Significant 
correlation  between  the  frequency  of  exception  conditions 
and  failure  rate  has  been  demonstrated  [IYER83]. 

The  metrlo  will  be: 

EV  -  .1  +  4. SEC 


where  EC  is  the  number  of  exception  conditions  encountered 
per  hour . 

The  constant  value  of  .1  and  the  coefficient  of  4.5  where 
derived  as  a  result  of  the  analysis  in  [IYER83]. 
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3.4  TUCHG  OF  METRIC  APPLICATIOH  DURIXG  THE  LIFE  CYCLE 

Figure  3-10  indicates  when  during  the  development  phase  each  of 
the  metrios  identified  would  be  applied.  This  application 
requires  data  collection,  described  in  the  next  section,  and  then 
use  in  the  prediction  or  estimation  procedures  described  in 
Volume  II. 
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4.0  DATA  COLLECTION  IN  SUPPORT  OP  THE  SOPTWARB  RELIABILITY 
PREDICTION  AND  ESTIMATION  METHODOLOGY 


4.1  DATA  COLLECTION  APPROACH 

One  of  the  more  significant  undertakings  of  this  project  was  the 
data  collection  activities  associated  with  demonstrating  and 
validating  the  methodology.  The  goals  during  this  phase  of  the 
project  were: 

•  filter  the  candidate  measurements,  ie  eliminate  measure¬ 
ments  that  had  no  potential  for  utility  in  the  methodology 
and  identify  those  that  appear  to  have  predictive  or 
estimation  potential. 

•  Establish  a  data  base  from  which  a  draft  handbook  (Volume 
II)  could  be  developed. 

•  Collect  a  set  of  data  with  which  preliminary  validation 
efforts  could  be  performed.  These  validation  efforts  are 
preliminary  because  as  a  result  of  them  some  changes  to 
the  measurements  have  been  made  (thus  requiring  further 
iteration)  and  because  a  more  exhaustive  set  of  data  would 
be  required  to  perform  more  extensive  validation. 

•  Establish  data  collection  procedures  for  the  Reliability 
Prediction  and  Estimation  Methodology. 

The  overall  approach  to  the  data  collection  is  illustrated  in 
Figure  4-1. 

During  Phase  I,  a  number  of  projects  were  identified  as  potential 
sources  of  data  for  this  project.  Also  during  Phase  I,  a 
literature  search  was  conducted.  This  literature  search  had 
three  purposes.  One  was  to  identify  reliability  measures  that 
had  been  established  and  tried  within  the  industry.  A  second  was 
to  further  extend  the  references  available  to  software  reliabil¬ 
ity  practioners  and  document  terminology  (see  Appendix  A).  The 
third  reason  was  to  collect  any  documented  experiences  as  part  of 
the  data  base  to  be  used  in  this  project.  The  RADC  Data  and 
Analysis  Center  for  Software  (DACS)  and  the  NASA  Software 
Engineering  Laboratory  (SEL)  data  bases  were  also  utilized. 

Each  software  project,  data  base,  and  reference  were  analyzed  for 
applicability  to  this  effort.  The  analysis  mainly  consisted  of 
identifying  whether  enough  documentation,  source  code,  and 
failure  history  existed  and  was  available  for  use.  If  this  data 
existed  and  was  available,  further  Investigations  were  conducted 
to  determine  where  in  the  life  cycle  the  data  was  from,  how 
reliable  the  data  was,  and  how  current  the  data  was.  Some 
projects  and  sources  were  eliminated  from  consideration  because 
of  these  factors.  The  resulting  set  comprised  the  candidate  set 
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of  projects  and  data  sources.  As  many  as  possible  were  included 
in  the  data  collection  and  validation  activities.  A  few  were  not 
because  the  level  of  effort  of  this  project  prohibited  their 
inclusion.  Those  projects  have  been  retained  for  future  analy¬ 
sis.  The  next  paragraph,  4.2,  identifies  all  of  the  candidate 
projects  and  data  sources. 

The  next  step  in  the  data  collection  approach  was  to  sort  the 
projects  and  data  sources  as  to  their  applicability  to  the 
candidate  prediction  and  estimation  measurements  identified  in 
the  preceding  section.  This  sort  was  necessary  for  two  reasons. 
The  first  is  that  the  measurements  themselves  represent  different 
levels  of  data  spanning  system  level  characterizations  down  to 
module  level  measures,  different  time  periods  in  a  system  life 
cycle,  and  require  different  levels  of  problem  reporting  associa¬ 
tion.  Thus  the  measurements  require  different  levels  of  detail 
and  this  step  provided  for  the  process  of  aligning  projects  and 
data  sources  with  metrics.  A  second  reason  this  step  was 
necessary  was  that  all  the  projects  and  data  sources  were  not 
compatible  in  terms  of  data  availability.  Some  only  provided 
data  at  a  system  level.  Some  only  provided  detailed  data  for 
certain  measurements  and  not  all.  This  non-homogeneity  is  a  fact 
of  life,  all  data  collection  efforts  are  faced  with  it.  Our 
approach  to  dealing  with  this  fact  was  to  gather  enough  data  from 
enough  sources  to  be  able  to  fully  cover  all  of  the  measurements. 
There  is  further  discussion  of  this  point  in  paragraph  4.2. 

Data  collection  procedures  were  established  and  the  data  collec¬ 
tion  activities  proceeded.  Periodio  data  collection  team 
meetings  were  held  to  not  only  check  progress,  but  to  discuss 
problems  being  encountered  so  that  corrective  actions  could  be 
taken.  As  a  result  of  these  meetings  a  number  of  lessons-learned 
have  been  recorded  and  are  discussed  in  paragraph  4.4.  As  part 
of  the  data  collection  activities,  any  tools  that  would  aid  in 
the  data  collection  were  identified  and  used.  The  tools  used  are 
described  in  paragraph  4.3. 

Figure  4-2  is  a  more  detailed  illustration  of  the  data  collection 
activities.  Two  RADC  Technical  Reports  (RADC  TR  85-37  and  RADC 
TR  84-53)  were  key  to  the  data  collection  activities.  RADC  TR 
85-37  provided  a  set  of  worksheets  associated  with  many  of  the 
Software  Characteristics  Metric  (Anomaly  Management,  Traceabil¬ 
ity,  Quality  Review  Results,  Size,  Modularity,  Complexity,  and 
Standards  Review  Results).  RADC  TR  84-53  provided  a  process  for 
evaluating  the  Testing  Methodology.  The  data  collection  activi¬ 
ties  essentially  paralleled  an  actual  application  of  the  Relia¬ 
bility  Prediction  and  Estimation  Methodology  (see  Volume  II).  A 
set  of  data  collection  tasks  was  oriented  toward  collecting  the 
data  associated  with  the  prediction  metrics.  This  set  was 
generally  applied  to  the  documentation  and  source  code.  Another 
set  was  oriented  toward  collecting  the  data  associated  with  the 
estimation  metrics.  This  set  was  generally  applied  to  the  test 
(in  some  cases  operational)  results.  As  part  of  this  second  set, 
failure  data  was  collected  which  later  was  used  to  demonstrate 
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and  validate  the  use  of  the  measurements  as  predictors  and 
estimators  of  software  reliability. 

In  both  cases,  an  initial  set  of  data  collection  procedures  were 
produced  to  aid  in  the  data  collection  activities  and  based  on 
the  experience  revised.  The  data  collection  procedures  are 
included  as  Appendix  B  to  Volume  II. 

The  primary  end  result  of  the  data  collection  activities,  besides 
the  data  collection  procedures,  was  a  data  base  that  could  be 
used  to  demonstrate  and  validate  the  measurements  identified  in 
Section  3  of  this  report . 

4.2  DATA  SOURCES 

The  sources  of  data  for  this  project  fall  into  three  categories: 
existing  data  bases  such  as  the  DACS  and  SEL  data  bases;  results 
and  data  reported  in  the  literature;  and  data  collected  from 
projects  during  this  contract  effort. 

In  the  following  paragraphs,  a  brief  description  of  each  source 
of  data  used  during  this  effort  is  described  and  a  reference,  if 
appropriate,  is  sited.  The  type  of  data  available  from  each  of 
these  projects  is  also  described.  In  situations  where  the 
project  sited  was  used  as  a  source  for  detailed  data,  the  various 
documents  and  data  available  is  Identified.  A  summarization  of 
these  data  sources  is  in  Table  4-1. 

Radar  Control  System  Cl~) 

This  project's  error  history  was  documented  in  [WILL77]  and 
compared  with  other  projects  in  [FISH79] .  It  is  a  real-time 
control  system  for  a  land-based  radar  complex.  It  was  written  in 
JOVIAL  and  assembly  language.  The  data  available  was  primarily 
used  to  distinguish  fault  densities  by  application  type.  The 
failure  data  represented  integration  and  operational  test 
results . 

Avionics  Control  System  (2) 
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This  project's  error  history  was  documented  in  [FRIE77]  and 
compared  with  other  systems  in  CPISH79] .  It  is  an  avionics 
control  system  that  was  developed  in  JOVIAL  and  assembly  lan¬ 
guage.  The  data  available  was  primarily  used  to  distinguish 
fault  densities  by  application  type.  The  failure  data  repre¬ 
sented  module  verification,  intermodule  compatibility. 

Satellite  Command  and  Control  System  C 3 ) 

This  project's  error  history  was  documented  in  [THAY76]  and 
compared  with  other  systems  in  [FISH79].  It  is  a  large  command 
and  control  system  written  in  JOVIAL  and  assembly  language.  The 
data  available  was  primarily  used  to  distinguish  fault  densities 
by  application  type.  The  failure  data  represented  development 
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testing,  validation  testing,  acceptance  testing,  integration 
testing  and  operational  testing  results. 

ABM  Command  and  Control  System  (4) 

This  project's  error  history  was  documented  in  [ BAKE77 ]  and 
[M0TL76] .  It  was  compared  with  other  systems  in  [FISH79] .  It  is 
a  ground-based  command  and  control  system  for  an  anti-ballistic 
missile  system.  It  was  written  in  the  CENTRAN  programming 
language  and  the  failure  data  collected  represented  unit  testing, 
functional  testing  and  system  integration  testing  results.  The 
data  available  was  used  primarily  to  distinguish  fault  densities 
among  applications. 

Cft I  System  C5-) 

This  project  is  a  classified  Command,  Control,  Communications  and 
Intelligence  system.  Due  to  the  classification,  the  system  is 
not  identified  nor  is  documentation  available.  The  failure 
history,  collected  during  an  operational  window  of  four  months 
was  provided  in  an  unclassified  form  for  use  in  this  effort.  The 
data  available  consists  of  failure  rate  data. 

Interactive  System  (6) 

This  project  is  an  interactive  system  developed  by  a  Government 
fiscal  agency  for  use  internally.  The  data  represents  opera¬ 
tional  failures  during  a  six  month  period  during  1981.  The  data 
available  was  used  primarily  to  distinguish  failure  rates  by 
application  type. 

Scientific  System  (7~) 

This  project  is  the  Launch  Support  Data  Base  (LSDB)  program  at 
Vandenberg  AFB.  The  failure  data  was  derived  and  reported  in 
[HECH77].  The  data  represents  failure  rate  data  collected  during 
development  and  integration  testing  prior  to  acceptance.  It  was 
used  primarily  to  distinguish  failure  rates  by  application  type. 

Flight  Control  System  (8) 

This  project  is  the  digital  flight  control  system  of  the  Advanced 
Fighter  Technology  Integration  (AFTI)  F-16  program.  The  failure 
rate  observed  during  flight  testing  over  a  13  month  period  was 
reported  in  [MACK83a,b], 

Command  and  Control  Operating  System  C91 

This  project  is  a  classified  ground-based  command  and  control 
system.  The  software  problem  reports  reported  over  a  25  month 
period  were  collected.  The  average  amount  of  testing  done  per 
month  was  200  hours. 


1 


This  project  is  a  large  complex  training  system  built  to  support 
the  U.S.  Army.  The  system  is  comprised  of  a  real-time  message 
handling  subsystem,  interactive  graphics  workstations,  and 
post-operations  play  back.  The  system  provides  real-time  display 
of  instrumented  exeroises  to  observers.  This  project  was  used  as 
a  source  of  most  of  the  detailed  data  required.  A  complete  set 
of  development  documentation  as  well  as  source  code,  test  results 
and  operational  performance  data  was  available  or  collected  for 
analysis . 


The  mission  planning  system  for  the  Air  Launch  Cruise  Missile  was 
a  source  of  Independent  Verification  and  Validation  problem 
reports.  Development  problem  report  statistics  were  available 
for  an  initial  version  of  the  system.  This  system  contains 
planning  software  and  report  generation  software. 


This  data  set  contains  data  from  four  flight  control  and  related 
program  applications.  The  data  is  reported  in  t PRES81 ]  and 
l ROCK8 1 ]  and  analyzed  in  [HECH83] .  The  data  reported  is  fault 
density  and  was  used  primarily  for  establishing  the  application 
type. 


This  data  set  represents  four  interactive  s~~tems,  one  a  commer¬ 
cial  system  and  three  military  systems.  These  data  sets  are 
reported  in  [MUSA79]  (as  systems  5,  17,  27,  and  40).  Each  system 
is  a  large  interactive  system  and  the  fault  density  data  provided 
is  from  system  test. 


The  source  for  this  data  set  is  [DAVI81].  It  is  an  electronic 
switching  system  developed  by  Bell  Laboratories .  The  data 
presented  is  from  Installation  and  operations,  for  the  system  and 
represents  a  very  high  reliability. 


This  data  set  is  from  the  Viking  project  at  the  Jet  Propulsion 
Laboratory  [MAXW78] .  Failure  rate  data  is  provided  from  a  four 
month  period  during  operations. 


This  data  set  is  from  a  classified  command  and  control  system. 
The  data  available  are  fault  density  and  source  code  characteris¬ 
tics.  The  failure  data  is  from  development  and  integration 
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Process  Monitoring  System  Cl7) 

This  data  set  is  from  an  Emergency  Response  Information  System 
developed  to  monitor  a  Nuclear  Power  Plant.  Data  available 
includes  fault  density,  development  documentation,  source  code, 
and  code  characteristics.  The  failure  data  available  represents 
problems  recorded  during  acceptance  testing  and  operational  use. 

Support  System  C18) 

This  data  set  is  a  data  reduction  system  developed  for  in-house 
use  on  the  F-11D  project  [WAG073] .  Failure  rate  data  is  avail¬ 
able.  The  data  was  used  primarily  to  determine  Application  Type 
baselines . 

Command  and  Control  Systems  C  19.1 

This  data  set  is  comprised  of  four  real-time  display  management 
and  command  execution  systems,  all  command  and  control  applica¬ 
tions.  The  data,  consisting  of  fault  density  and  failure  rate 
data,  is  recorded  in  [MUSA79]  as  systems  1,  2,  3  and  4.  This 
data  was  used  primarily  to  establish  Application  Type  baselines. 

Interactive  Operating  System  (ZQ1 

This  data  set  represents  failure  rates  for  two  computer  installa¬ 
tions  at  Stanford  University  [IYER81].  The  data  spans  three 
years  of  operational  use.  This  data  was  used  primarily  to 
establish  Application  Type  baselines. 

Image  Processing  System  (21) 

This  data  set  was  reported  in  [GRAS82]  for  an  Image  Processing 
System  development.  During  the  development,  a  committment  to 
collect  software  quality  metrics  was  made.  The  results  of  this 
application  are  reported  in  the  above  reference.  Failure  data 
was  collected  during  two  incremental  builds  of  a  system  and 
during  acceptance  testing. 

Flight  Control  (22) 

This  data  set  is  for  the  ALCM  Operational  Flight  System  reported 
in  [HECH83] .  Fault  density  data  is  available  and  was  used 
primarily  to  establish  an  Application  Type  baseline. 

Flight  Control.  (23.1 

This  data  is  also  in  [HECH83]  and  represents  several  projects  or 
generations  of  the  same  system.  Fault  Density  was  available. 
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This  data  represents  support  software  and  a  simulator  supporting 
flight  control  software  development  and  testing.  It  is  summar¬ 
ized  in  (HECH833.  This  study  used  the  summarization  of  the  fault 
density  experience  data  to  help  establish  an  Application  Type 
baseline. 

Satellite _ Q**- — C 25  2 

This  data  is  a  subset  of  data  available  from  the  SEL  data  base. 
It  is  reported  in  [HECH83] ,  [BASI77] ,  [ C ARD8 2 ]  and  [ TURNS 1 ] .  The 
fault  densities  recorded  for  11  different  projects  or  software 
systems  were  used  to  help  establish  an  Application  Type  baseline. 
All  of  the  systems  were  related  to  the  Satellite  C2/Telemetry 
processing  systems  developed  and  operated  at  NASA/Goddard. 

MIS . £26 1 

This  data  was  reported  in  [HIER86] .  It  is  from  four  projects 
involving  small  business  systems.  An  analysis  of  the  inpact  and 
benefit  software  quality  metrics  can  have  was  reported  in  the 
reference.  The  development  environment  and  test  effort  was 
available  as  well  as  fault  density  for  these  four  projects 
ranging  in  size  between  10,000  and  30,000  lines  of  code. 

ARGOS  (2?) 

This  data  was  reported  in  [TROY863  as  a  study  of  software  failure 
reporting  within  a  large  data  processing  center.  The  data 
processing  center  is  for  the  purpose  Of  acquiring,  processing  and 
distributing  telemetry  data.  Failure  rate  information  is 
provided. 

Interactive  System  (28 3 

This  project  involved  a  dual  CPU  processing  system  able  to  handle 
500  on-line  users  [MIYA-] .  Software  as  well  as  hardware 
reliability  goals  were  set  for  project  and  progress  toward  the 
achievement  of  these  goals  was  monitored.  An  evaluation  of 
reliability  models  [GOEL83]  was  made.  Failure  rate  data  was 
provided. 

Slgsal  Frocesslag-laai 

Failure  rate  and  failure  density  data  is  provided  in  [MEND79]  for 
two  signal  processing  applications.  Additionally  an  evaluation 
of  error  types  and  validity  of  reliability  models  are  presented. 

MIS  C  30.1 


Fault  density  data  is  provided  based  on  an  evaluation  of  an  Army 
Logistics  Support  MIS  system  (LEHM823 .  Over  1.6  million  lines  of 
code  are  represented  in  the  study. 


Fault  density  and  error  categorization  data  is  presented  in 
[WEIS78]  for  a  computer  architecture  simulation  facility. 


This  data  source  is  project  2  reported  in  t THAY76 ] .  Fault 
Density  and  software  and  error  characteristics  are  provided  for 
this  command  and  control  system  written  in  JOVIAL. 


This  data  source  is  project  5  reported  in  [THAY76] .  It  is  a 
simulator  developed  in  FORTRAN  and  Assembly  language.  Fault 
Density  and  software  and  error  characteristics  are  provided. 

Thirty-three  (33)  data  sources  are  identified  representing  59 
different  projects.  Most  of  these  data  sets  were  used  during 
this  project  to  establish  some  baseline  reliability  numbers  for 
different  types  of  applications.  Several  were  used  to  evaluate 
the  candidate  predictive  and  estimation  measures  Identified  in 
the  preceedlng  section.  Data  Sources  10  and  17  specifically  were 
projects  from  which  detailed  data  were  collected  for  the  purpose 
of  demonstrating  and  validating.  The  DACS  and  SEL  data  bases 
were  utilized  to  the  extent  possible.  Data  Sources  1,  2,  3,  4, 
13,  19.  25  are  in  either  the  DACS  or  SEL.  This  data  was 
typically  analyzed  and  reported  elsewhere  (references  are  noted). 

4.3  EXAMPLE  DATA 

The  data  collected  for  this  study  basically  is  that  set  of  data 
required  to  calculate  the  metrics  described  in  Section  3.  A 
complete  set  was  delivered  to  RADC  as  part  of  this  contract.  To 
illustrate  the  data  collected,  examples  are  provided  in  this 
section.  The  data  is  presented  by  metric  here  to  facilitate 
reference  and  correlation  to  the  validation  results  presented  in 
the  next  section. 

4.3.1  Application 

Table  4-1  provided  a  brief  description  of  each  data  source  with 
respect  to  the  type  of  system  (application  type)  represented  by 
the  data  source.  Table  4-2  presents  a  summary  of  the  fault 
density  or  failure  rate  data  collected  for  each  of  these  data 
sources . 

The  fault  density  depicted  is  the  number  of  failures  (software 
problems  reported)  divided  by  the  number  of  executable  source 
lines  of  code  which  mahe  up  the  software  system. 

In  most  cases,  collecting  this  data  was  straight  forward.  Data 
bases  examined  or  articles  referenced  typically  identified  the 
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TABLE  4  2  SUMMARY  OF  FAULT  DfcNSITY /FAILURE  RATE 
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number  of  failures  recorded  against  a  system  and  also  the  size  of 
the  system.  In  some  cates,  the  failures  (problem  reports)  and 
size  data  were  provided  by  module  or  subsystem  and  had  to  be 
totalled. 

The  failure  rates  depicted  are  the  average  failure  rate  experi¬ 
enced  during  testing  of  the  system,  i.e.,  the  number  of  failures 
observed  divided  by  the  total  time  spent  testing,  the  failure 
rate  observed  at  the  end  of  the  test  phase,  and  the  failure  rate 
observed  during  operation  of  the  system.  The  failure  rate  at  end 
of  test  is  calculated  by  taking  the  average  failure  rate  observed 
during  the  last  three  test  periods.  Computer  operational  time  is 
used.  This  table  has  been  organized  by  Application  Type.  An 
analysis  of  this  data  is  presented  in  Section  5.  CPU  execution 
time  could  be  used  but  since  it  was  rarely  available,  computer 
operation  time  is  used  as  a  close  approximation  of  CPU  execution 
time.  Where  available,  a  conversion  factor  is  used  to  translate 
CPU  execution  time  to  computer  operational  time. 

Software  failure  rate  data  is  typically  more  difficult  to  find 
reported  or  to  have  collected.  The  missing  element  is  usually 
the  time.  At  a  minimum  problem  reports  should  be  dated  or 
operator's  logs  annotated  when  problems  are  encountered.  Figure 
4-3  is  an  example  where  the  problem  report  history  (data  source 
9)  is  time  stamped  only  by  month.  In  this  case  (data  source  9  is 
a  classified  real  time  system),  this  is  the  only  data  available 
from  this  project  except  an  estimate  that  on  the  average  200 
hours  of  computer  time  was  spent  testing  the  software  each  month. 
This  data  is  enough  to  calculate  the  failure  rate  shown  in  Table 
4-2. 

Although  Table  4-2  is  at  present  only  partially  populated,  the 
trends  within  the  columns  are  about  as  expected.  This  is 
particularly  true  for  the  end  of  test  and  operational  failure 
rates,  the  key  measures  for  this  project.  We  find  in  all  cases 
where  data  exists  for  two  or  more  of  these  columns  that  the 
failure  rate  decreases.  A  few  of  the  entries  in  Table  4-2  are 
described  in  a  little  more  detail  in  the  following  paragraphs  for 
illustration  of  the  data  calculations. 

The  Data  Sources  8  and  12  are  examples  of  airborne  applications. 
Failure  data  for  testing  was  reported  on  the  Advanced  Fighter 
Technology  Integration  (AFTI)  F-16  Program  [MACK83]  (data  source 
8).  The  failure  rate  represents  15  incidents  during  the  flight 
test  program  which  involved  approximately  180  flight  hours.  No 
record  of  failures  observed  during  the  ground  operation  or  ground 
operating  time  is  available.  Most  of  the  failures  related  to 
synchronization  provisions  between  the  triple  redundant  computers 
installed  in  the  aircraft.  Software  changes  were  used  to  correct 
the  problems.  It  is  not  clear  whether  the  cause  of  the  failures 
was  due  to  software  deficiencies  or  to  system  deficiencies  that 
were  overcome  by  program  changes.  Thus,  the  failure  rate  may  be 
overestimated . 
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The  fault  density  for  data  source  12  is  derived  from  two  flight 
control  programs,  consisting  of  approximately  40,000  lines  of  AED 
code  each  [HECH83] .  The  individual  fault  densities  are  0.0018 
and  0.0086  respectively. 

Data  Sources  5,  9,  and  14  are  examples  of  strategic  applications. 
The  fault  density  for  a  real-time  C3I  system  (data  source  5)  is 
shown  in  the  table  and  is  the  overall  fault  density  (.0085)  of 
four  subsystems,  with  individual  measures  of  0.004,  0.01,  0.01, 
and  0.02.  The  operational  software  failure  rate  is  the  six-month 
average  for  the  command  and  control  computer  associated  with  the 
large  surveillance  radar  system. 

One  real-time  operating  system  application  is  represented  as  data 
source  9  in  Table  4-2.  It  was  tested  over  a  25  month  period  for 
an  average  of  200  hours  per  month  (Figure  4-3),  and  a  total  of 
270  failures  were  logged  during  that  interval,  equating  to  the 
.054  failure  rate  shown.  During  the  last  two  months  of  test  (400 
hours),  eight  failures  were  observed  (.02  failure  rate).  The 
real-time  operating  system  is  part  of  a  classified  military 
software  project. 

The  data  for  the  electronic  switching  system  software  (data 
source  14)  pertains  to  No.  4  ESS  as  reported  in  [DAVI81] .  An 
average  of  1.6  service-affecting  incidents  were  reported  per 
installation-month  during  the  first  quarter  of  1980,  and  25%  of 
these  were  attributed  to  software  (an  additional  13%  were 
unresolved).  The  entry  in  Table  4-2  assumes  that  there  were  0.5 
software  failures  during  a  720  hour  interval  (the  system  operates 
24  hours  per  day),  which  includes  an  allowance  for  the  unresolved 
incidents.  The  program  involves  over  2  million  object  words,  but 
little  is  known  about  other  characteristics.  The  electronic 
switching  systems  designed  by  Bell  Laboratories  are  recognized  as 
representing  unusually  high  hardware  and  software  reliability, 
and  hence  it  is  not  surprising  that  this  system  has  the  lowest 
operational  failure  rate. 

The  entries  associated  with  data  source  19  under  the  Tactical 
Application  category  are  four  Real-time  C2  Systems,  each 
Involving  approximately  20,000  HOL  instructions  that  involved 
display  management  and  command  execution  (Projects  1-4  in 
[MUSA79] ) .  In  computing  the  fault  density  for  these  systems 
which  were  described  in  [MUSA79]  in  lines  of  object  code,  it  has 
been  assumed  that  two  object  instructions  are  equivalent  to  one 
HOL  statement.  This  expansion  ratio  was  used  due  to  the  language 
and  computer  used  for  these  systems.  These  four  projects  were 
carried  out  within  a  single  organization  and  hence  it  is  not  too 
surprising  to  find  a  fairly  narrow  spread  of  the  reliability 
indicators.  The  failure  rate  at  the  end  of  test  shows  a  very 
small  range.  This  characteristic  can  be  controlled  in  effect  by 
the  developing  organization  (by  holding  up  the  release  until  an 
acceptably  low  failure  rate  is  reached). 

Data  Sets  17  and  21  are  the  two  examples  of  the  Process  Contr  . 
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Application  Category.  Data  Set  17  is  an  emergency  response 
information  system  for  a  power  plant.  The  fault  density 
represents  the  number  of  problems  found  in  the  19,000  lines  of 
code  developed  for  that  system.  Data  Set  21  is  an  image 
processing  system  of  over  120,000  lines  of  code. 

Data  Sources  6,  7,  13,  15,  and  20  are  examples  of  the  Production 
Application  category.  The  in-house  interactive  program  in  data 
source  6  supports  a  major  fiscal  agency  of  the  U.S.  Government. 
The  data  were  taken  during  the  last  half  of  1981  when  software 
outages  totaled  3,219  minutes.  From  related  reports,  the  average 
software  outage  lasted  10  minutes,  and  thus  it  was  assumed  that 
322  failures  ooourred.  The  .total  operating  time  during  this 
period  was  approximately  3,000  hours. 

The  fault  density  and  test  failure  rates  for  a  scientific  batch 
program  from  the  Launch  Support  Data  Base  (LSDB)  program  at 
Vandenberg  AFB  C HECH77 ]  is  in  data  source  7.  The  failure  rates 
were  originally  provided  in  execution- time  seoonds  which  have  an 
expansion  factor  of  approximately  10  to  wall-clock  seconds.  When 
this  factor  is  applied  and  the  seconds  converted  to  hours,  the 
failure  rates  amount  to  68  per  hour  (average)  and  nine  (9)  per 
hour  (end  of  test).  This  is  very  much  higher  than  any  other  data 
recorded  in  these  columns.  Possible  reasons  for  this  discrepancy 
are: 

e  The  early  date  of  these  programs  (coding  took  place  in 
1974  and  1975). 

e  The  test  period  lnoluded  unit  test  which  is  usually  run 
outside  of  configuration  management  and  hence  excluded 
from  most  reported  data.  This  affects  primarily  the 
average  test  failure  rate. 

•  The  testing  reported  here  was  followed  by  an  acceptance 

test,  the  results  of  which  are  not  included  in  the  data. 

The  end  of  test  failure  rate  for  the  acceptance  test  can 
be  expected  to  be  lower. 

The  failure  rates  for  data  source  13  are  derived  from  System  5, 

System  17,  System  27,  and  System  40  in  [MUSA79] .  All  of  these 

programs  are  display  oriented  and  implement  math-intensive 
functions.  The  fault  densities  range  from  0.0013  to  0.0025.  The 
failure  rates  range  from  0.0044  to  0.13.  One  of  three  systems 
involves  over  2  million  object  instructions,  but  no  other 
software  characteristics  are  described. 

Data  Set  15  contains  the  operational  failure  rate  of  a  scientific 
system  based  on  a  four  month  observation  of  the  Viking  telemetry 
data  reduction  program  at  the  Jet  Propulsion  Laboratory  [MAXV78] . 

The  data  for  an  interactive  operating  systems  (data  source  20) 
were  derived  from  two  large  computer  Installations  at  Standford 
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University,  SLAC  and  CIT  during  1978  -  1980  [IYER8U.  There  is 
very  little  year-to-year  variability,  and  failure  rates  for  the 
two  Installations  are  also  quite  olose  (0.024  for  SLAC  and  0.017 
for  CIT  in  1980).  Only  unique  (nev)  problems  were  counted  as 
failures . 


One  of  the  Developmental  Category  data  sets  (data  source  18)  is 
derived  from  the  P-1 ID  data  reduction  program  reported  in 
[ WAG073] .  These  failure  rates  were  also  collected  in  CPU-seconds 
and  have  much  higher  values  in  wall-olook  hours.  This  is  an  even 
older  program  them  LSDB,  and  this  may  help  to  account  for  the 
high  failure  rate.  The  program  was  developed  in-house  for  a  data 
reduction  task  that  was  initially  assumed  to  be  of  very  limited 
scope  and  then  expanded.  As  is  typical  under  those  circum¬ 
stances,  there  are  very  few  formal  requirements,  and  the  extent 
of  test  is  largely  left  up  to  developer.  Thus,  a  higher  failure 
rate  must  be  expeoted  for  support  programs  under  these  condi¬ 
tions. 


In  Section  5  the  consolidation  of  the  fault  densities  and  failure 
rates  by  application  category  is  presented. 


Eventually  v&  hope  that  enough  data  may  be  collected  to  bypass 
the  use  of  fault  density  as  a  reliability  predictor  altogether. 
In  that  case  baseline  failure  rates  achieved  (typical)  on  actual 
applications  would  be  used.  The  subsequent  prediction  metrics 
would  modify  this  baseline  failure  rate  up  or  down  much  like  they 
are  intended  to  do  for  fault  density.  The  other  benefits  of 
collecting  the  failure  rate  data  shown  in  the  Table  4-2  are : 


The  failure  rates  oan  be  used  to  track  observed  results 
during  a  development  effort.  Reliability  growth  can  be 
tracked  according  to  typical  experiences.  Lack  of 
progress  can  be  reported  to  management  for  their  action. 


e  The  empirioal  relationship  between  fault  density  and 
failure  rate  can  be  derived  (see  Seotion  5). 


4.3.2  Development  Environment 


This  metric  is  concerned  with  the  effects  of  the  development 
process  which  are  manifest  in  the  reliability  of  the  software 
product.  Table  4-3  contains  a  very  brief  description  of  the 
development  environments  for  the  projects  being  used  in  this 
study  as  data  souroes.  Mot  all  development  environments  are 
described.  For  those  that  are  described,  they  were  characterized 
as  an  embedded  (B),  semi-detached  (S)  or  organio  (0)  environment 
aooording  to  the  metrio  described  in  Seotion  3. 


4.3.3  Software  Characteristics 


The  software  characteristics  measurements  identified  in  Section  3 
posed  a  much  more  significant  data  collection  challenge.  To 
fully  satisfy  the  data  collection  requirements  of  many  of  these 


measurements,  detailed  data  had  to  be  collected.  Examples  of 
data  collected  for  each  measurement  are  provided  In  the  following 
paragraphs . 


The  two  data  sources  used  primarily  for  the  detailed  data 
collection  were  the  Training  System  (data  souroe  10)  and  the 
Emergency  Response  Information  System  (data  source  17).  These 
two  systems  were  recently  delivered  and  are  being  maintained. 
Key  personnel  Involved  In  the  developments  were  available  for 
discussions  and  Information  If  necessary.  Documentation  and 
source  code  were  available.  The  following  paragraphs  Indicate 
the  available  sources  of  data  for  each  of  these  two  systems  and  a 
brief  description  of  the  system. 


This  system  Is  a  large  oomplex  taotloal  training  system.  The 
system  Involved  Instrumented  military  exerolses  where  the  units 
participating  In  the  exercise  utilized  Instrumented  laser  weapons 
and  key  players  and  weapon  systems  carry  transponders  so  that 
their  location  and  movement  oan  be  traoked  via  a  communications 
network  by  computer.  Additionally  video  and  communication  data 
Is  captured.  All  this  data  Is  sent  In  real  time  to  a  computer 
complex  where  observers  are  sitting  at  workstations  observing  the 
exercise.  These  workstations  have  graphics  displays  where  the 
exercise  Is  shown  on  a  terrain  map  background  generated  from  the 
Defence  Mapping  Agency  digital  terrain  data  base.  The  software 
system  that  accepts  this  data,  displays  It  at  observer 
workstations,  allows  the  observers  to  oontrol  displays  and  stores 
the  data  from  the  complete  exercise  to  facilitate  playback  for 
the  purposes  of  debriefing  the  participants  Is  the  data  source. 
The  major  subsystems  of  this  system  are  the  system  software,  the 
display  subsystem  and  the  computational  component  subsystem.  The 
system  Is  a  distributed  system  in  that  portions  of  the  software 
run  on  four  VAX  11/700 's  and  38  workstations  with  LSI  11/23 
processors . 

The  primary  documentation  utilized  to  oolleot  data  was: 

•  Requirements  Design  Specification  -  Vol  I. 

•  Requirements  Design  Specification  -  Vol  II,  Part  A 

•  Requirements  Design  Specification  -  Vol  II,  Part  B 

These  documents  represented  a  statement  of  the  requirements, 
preliminary  design  and  detailed  design  of  the  system.  Addition¬ 
ally,  test  documentation,  user  documentation,  and  test  result 
documentation  were  available. 

Software  Discrepancy  Reports  were  reported  throughout  the  formal 
testing  and  operation  of  the  system.  Several  major  enhancements 
have  been  made  over  the  last  three  years.  With  each  enhancement, 
a  formal  test  and  evaluation  prooess  was  performed.  Figure  4-4 
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TABLE  4-3.  DEVELOPMENT  ENVIRONMENT.  SIZE,  AND  LANGUAGE 
CHARACTERISTICS  OP  THE  DATA  SOURCES 


Dm 

Source 

Aoolicttion 

Development  Environment 

Otic  notion 

Si/e 

Lmguige 

Dev 

Moot 

1 

Radar  C2 

Build  Procati 

Holt  —  Targat  Cron  Compilar 

Oa bugging  Package 

Simulator 

MILSTD  Development 

Librarian,  Source  Reformetter 

Data  Sat/Uiad  Cron  Reference 

Unit,  Integration.  Aceaptanca  Testing 

136.707 

JOVIAL  (64%) 
Assembly  (36%) 

E 

2 

Avionics 

No  Standard 

3  Debug  Toolt 

Hoat  —  Targat  Cron  Compiler 

Simulator 

Optimitation  Tool 

Module  Verification,  Integration, 

Syttam  Validation  Tatting 

120,000 

Fortran  133%) 
Assembly  (67%) 

E 

3 

Ground  Based  C2 

NR 

115,346 

JOVIAL 

E 

4 

C2 

Phaaad  Approach  m/Ooc  not  Followed 
Top  Oown,  Structured  Programming 
Unit,  Integration,  Acceptance  Tatting 

161.249 

Centran 

S 

C2 

No  Build  Approach 

Formal  Tatting  Throutft  Davelopmartt, 
Validation,  Aceaptanca.  Integration, 
and  Operational  T eating 

115.346 

JOVIAL 

s 

5 

C2I 

Embedded  Derelopment  Env. 

63.827 

HOL 

E 

6 

MIS  (Interactive) 

NR 

NR 

NR 

O 

7 

Scientific  (Batch) 

NR 

90,000 

Fortran 

NR 

8 

Flight  Control 

Advanced  HIM  Fault  Tolerant 
Architecture 

Top  Down  Design 

Bottom  Up  Testing 

MIL  STD  1679  Lika  Development 

NR 

NR 

6(1 1 

9 

Real-Time  OS 

NR 

NR 

NR 

NR 

10 

Training  Syttam 

Not  MIL  STO 

Structured  Approach 

Interactive  Buildl 

Programmer  Workbench 

45.702 

Fortran 

S 

11 

Mmion  Planning 

NR 

4.703 

S-  Fortran 

NR 

12 

Flight  Control 

Sami  Oeteched  Development 
Environment 

44,400 

43.500 

AED 

AED 

S 
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TABLE  4-3.  DEVELOPMENT  ENVIRONMENT,  SIZE.  AND  LANGUAGE 
CHARACTERISTICS  Of  THE  DATA  SOURCES  ICONT.) 


Dm 

Development  Environment 

Dev  1 

Sou'ce 

Application 

Oetcription 

Si/9 

Linguist 

Vose 

13 

Inter  act 

NR 

2,445,000* 

NR 

NR 

61.900* 

NR 

126,100* 

NR 

160,000* 

NR 

*Obiect 

14 

Electronic 

Switching 

NR 

NR 

NR 

— 

NR 

15 

Scientific 

NR 

NR 

NR 

NR 

16 

C* 

MILSTD  Document 

Batch  Card  Oriented 

Com  pool 

183,330 

JOVIAL 

S 

17 

RRIS 

Commercial  Development 

Modern  Tooli 

Eitennve  Acceptance  Teet 

19,690 

Fortran 

E 

18 

Support 

NR 

NR 

NR 

NR 

19 

C* 

NR 

NR 

NR 

•Obiect 

20 

Inter  active 

OS  &  Batch 

NR 

NR 

NR 

NR 

21 

Image 

Phaiad  Approach 

120.400 

Fortran 

S 

Proeeiting 

Top  Down  Deiign 

POL 

Standard!  Uied 

22 

Plight  Control 

NR 

243.883 

Fortran  192%) 

NR 

Minion 

Preparation 

Aiiembly  (8%) 

23 

Plight  Control 

NR 

46.066 

Auembly  157%) 

NR 

HMW 

JOVIAL  (39%) 

Fortran  14%) 

24 

Support 

NR 

20,618 

Aiiembly  (94%) 

NR 

Program! 

38,218 

Fortran  (6%) 

25 

Satellite  C * 

NR 

811  630 

Fortran 

NR 

10.000 

15,000 


26 


Small  Bunnell 
MIS 


fint  Proieet  Involved 
Structured  Development 
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TABLE  4-3.  DEVELOPMENT  ENVIRONMENT.  SIZE.  AND  LANGUAGE 
CHARACTERISTICS  OF  THE  DATA  SOURCES  (CONT.) 


Dttt 

Sou  ret 

AoohCition 

Development  Environment 
Detcnplion 

$/># 

Lenguagt 

Dev 

Moae 

27 

Ground  Batad 

C 2 

NR 

NR 

NR 

NR 

28 

Comm.  Syltam 

NR 

NR 

Aiiambly 

a 

29 

Signal 

Procaiiing 

NR 

NR 

NR 

s 

30 

Logistics  MIS 

NR 

1.897.177 

Cobol 

a 

31 

Simulation 

Incramantal  Oavalopmant 

Modarn  Datign 

Coding  Standard! 

Programming  Taam 

10.038 

Fortran 

0 

32 

CJ 

Formal  Tail  Approach 

Formal  Oavalopmant  Environment 

96.931 

JOVIAL 

s 

33 

Simulator 

Formal  Mll-STO  Oavalopmant 
Incramantal  Oavalopmant 

28.564 

Fortran  139%  1 
Auambly  161%) 
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4. 3.3. 4  Modularity  CM),  Complexity  (SZ),  and  Standards  Review 
(SR)  Data  Collection 

These  measurements  required  aocess  to  the  source  code  or  a 
description  of  the  software  at  a  detailed  level.  A  tool  called 
the  Metric  Informatin  Tracking  System  (MITS)  which  is  similar  in 
function  to  the  Automated  Measurement  Tool  (AMT)  or  Automated 
Measurement  Systems  (AMS)  developed  for  RADC  was  used.  Figure 
4-6  is  an  example  of  the  output  from  MITS  for  elements  used  to 
compute  the  Modularity,  Complexity,  and  Standards  Review  metrics. 
Additional  source  code  inspection  was  required  in  some  cases. 
Figure  4-7  contains  a  composite  of  this  data  for  data  source  17. 
This  composite  is  provided  at  a  CSC  level.  The  number  of  units 
contained  in  each  CSC  (which  was  called  a  process  in  the  system), 
the  number  of  executable  lines  of  code  (for  modularity),  the 
number  of  branches  (for  McCabe's  Complexity),  was  well  as  other 
metric  elements  for  the  Standards  Review  Measurement  are  shown. 
The  diagonal  lines  provide  separation  between  the  raw  metric 
score  (upper  left)  and  the  calculated  metric  element  (lower 
right).  Also,  indicated  is  the  number  of  discrepancy  reports 
generated  against  each  CSC. 

4.3.4  Test  Measurements  Data  Collection 

The  three  test  measurements.  Test  Effort  (TE),  Test  Methodology 
(TM) ,  and  Test  Coverage  (TC)  require  different  types  of  data. 
The  Test  Effort  measurement  requires  access  to  labor  hour  data 
for  the  projects  and  a  work  breakdown  structure  accounting  system 
that  delineates  labor  expended  testing.  The  data  utilized  in 
this  study  came  from  data  sources  10  and  17  and  represented  data 
collected  from  projeot  management  and  test  and  evaluation 
management  personnel  via  interviews. 

The  Test  Methodology  measurement  requires  application  of  RADC  TR 
04-53.  The  handbook  (Volume  II)  of  that  report  contained  a 
methodology  which  when  applied  reoommends  testing  techniques  and 
tools  for  particular  applications  or  test  objectives.  The 
methodology  was  applied  to  data  sources  10  and  17.  Table  4-4 
presents  the  results  of  the  application  of  the  methodology  (path 
1). 

Test  Coverage  data  was  not  collected  on  either  of  the  two 
detailed  data  source  projects. 

4.3.5  Operational  Environment  Estimation  Measurements  Data 
Collection 

The  two  metrics  which  are  used  to  describe  the  influence  of  the 
operational  environment  on  the  reliability  estimation,  Workload 
(BW)  and  Input  variability  (8V),  were  also  not  collected  on 
either  of  the  two  detailed  data  source  projects.  Data  was 
avail&blle  from  IIYER83]. 
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69.  NUMAER  OF  FROCESSING  LINES 

49.  NUMAER  OF  EXECUTAALE  STATEMENTS 

0.  INITIALIZATION  STATEMENTS 
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2.  UNCONDITIONED  GOTOS 
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TABLE  4-4  APPLICATION  OF  RADC  TR  84-53 


BATA  SOURCE  10 


PATH  1 
STEP  1 


TEST  CONFIDENCE  LEVEL 


COST 

CRITICALITY 
SCHEDULE 
COMPLEXITY 
DEV.  FORMALITY 
SAV  CAT. 

ERROR  DET. 

TEST  COMP. 


1 1+8=1.375  (1) 


STEP  2 

SOFTWARE  CATEGORY  SELECTION 
SENSOR  +  SIGNAL  PROCESSING  (10)  /  DATA 
PRESENTATION  (14) 


STEP  3  CANDIDATE  TECHNIQUE  SELECTION 

•  CODE  REVIEWS 

•  ERROR  DETECTION 
STRUCTURE  ANALYSIS 

•  PROGRAM  QUALITY  ANALYSIS 
PATH  ANALYSIS 

•  DOMAIN  TESTING 
DYNAMIC  PATH  ANALYSIS 

•  PERFORMANCE  MEASUREMENT 

•  REAL  TIME  TESTING 


SELECT  TOOLS 

TEST  RESULT  ANALYZER 

*  TEST  DOC.  WRITER 

•  TEST  MANAGEMENT  SYSTEM 

*  TEST  DRIVER 

AUTOMATED  VERIFICATION  SYSTEM 

•  PERFORMANCE  MONITOR 


RATIO  OF  TECHNIQUESrrOOLS  USED  TO 
RECOMMENDED:  io 

TT 


■  TECHNIQUES  OR  TOOLS  USED  ON  PROJECT 
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DATA  SOURCE  17 


TEST  CONFIDENCE  LEVEL 


14+8-1.75  (2) 


*  CODE  REVIEWS 

*  ERROR  DETECTION 
STRUCTURE  ANALYSIS 

*  PROGRAM  QUALITY  ANALYSIS 
PATH  ANALYSIS 

*  DOMAIN  TESTING 
DYNAMIC  PATH  ANALYSIS 

*  PERFORMANCE  MEASUREMENT 

*  REAL  TIME  TESTING 
PARTICIPATION  ANALYSIS 
DATA  FLOW  GUIDED  TESTING 
ASSERTION  CHECKING 
RANDOM  TESTING 
MUTATION  TESTING 


TEST  RESULT  ANALYZER 
TEST  DOC.  WRITER 
TEST  MANAGEMENT  SYSTEM 
TEST  DRIVER 

AUTOMATED  VERIFICATION  SYS 
PERFORMANCE  MONITOR 
ASSERTION  CHECKER 
DATA  FLOW  ANALYZER 
RANDOM  TEST  GENERATOR 
MUTATION  ANALYSIS  SYSTEM 


i 
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4.3.6  Test  and  Operational  Test  Tine  Data  Collection 

Data  was  collected  to  facilitate  calculation  of  failure  rate. 
Table  4-5  provides  data  from  data  source  10  identifying  CPO  hours 
spent  testing  and  corresponding  discrepancy  reports  recorded. 
Data  Source  17  did  not  have  this  type  of  data  recorded,  however 
the  system  has  been  running  for  over  a  year  at  the  customer  site 
24  hours  a  day  and  only  41  software  discrepancy  reports  have  been 
reported  over  that  time  period. 


4.4  DATA  COLLECTION  LESSONS  LEARNED 

As  in  all  data  collection  activities,  lessons  were  learned  that 
would  have  enhanced  the  efficiency  with  which  the  data  collection 
was  performed  and  the  quality  of  the  data  collected.  Some  of  the 
specific  lessons  learned  during  this  effort  were: 

•  In  a  research  effort  such  as  this,  there  is  a  tendency  to 
want  to  continue  to  refine  the  metrics  and  identify  new 
ones  -  even  after  data  collection  activities  have  pro¬ 
ceeded.  At  some  point  in  any  project,  even  a  research 
effort ,  a  data  definition  document  should  be  developed 
which  specifically  identifies  the  data  elements  to  be 
collected.  This  document  should  be  driven  by  the  data 
collection  objectives  or  goals  and  each  data  element 
identified  should  be  related  to  a  specific  objective.  In 
a  research  effort,  other  elements,  not  specifically 
related  to  an  objective,  can  be  Identified  for  collection 
in  support  of  future  analyses  that  might  change  a  metric 
or  create  a  new  one. 

•  A  companion  document  to  the  data  definition  document 
should  also  be  prepared.  This  document  should  be  a  data 
collection  guide.  This  guide  should  at  a  minimum: 

-  Identify  the  sources  for  data  collection. 

Provide  all  forms  and  reports  for  data  collectors. 

Identify  any  data  base  management  systems  to  be  used 
for  storage  of  the  data  collected. 

Provide  a  case  study  or  example  to  illustrate  data 
collection  approach. 

In  addition,  the  guide  might  provide  any  implementation 
specifics  for  this  project,  for  example: 

-  Programming  language-speclf io  examples,  and 

-  Documentation-specific  examples. 
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MONTH 


YEAR 


TOTAL  #  PR* 


ELAPSED  TEST 
TIME  (HOURS) 


JUNE 

JULY 

AUGUST 

SEPTEMBER 

OCTOBER 

NOVEMBER 

DECEMBER 


OCTOBER 

NOVEMBER 

DECEMBER 


JANUARY 

APRIL 

JUNE 

SEPTEMBER 

OCTOBER 

NOVEMBER 


JANUARY 

FEBRUARY 

MARCH 

APRIL 

MAY 

JUNE 

JULY 

AUGUST 

SEPTEMBER 

OCTOBER 

NOVEMBER 


01.00 
OS. 32 
02.25 
13.12 
46.08 
67.50 


61.07 

24.42 

34.90 

NR 

NR 

20.08 

15.33 

28.87 


•  Retrieval  of  data  from  existing  data  bases  sucb  as  the 
DA CS  or  SSL  data  bases  are  usually  more  time  consuming 
than  anticipated.  Tbe  data  available  is  usually  not  as 
well  organized,  cross-referenced,  or  defined  as  veil  as 
expeoted.  Therefore,  this  data  should  be  depended  upon 
only  as  support  data  or  complementary  data,  to  support 
analyses  of  more  detailed  data  oolleoted. 

•  All  data  oolleoted  should  be  stored  in  a  centralized, 
controlled  data  base.  The  data  should  be  placed  in 
electronic  format  to  facilitate  later  analyses  and 
retrieval.  This  format  should  be  compatible  with  the 
DA  CS . 

It  is  recommended  that  future  data  collection  activities  include 
these  above  specific  requirements. 
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5.0  DEMONSTRATION  AID  VALIDATIOI  OF 
SOFTWARE  RELIABILITY  MEASURES 


5.1  APPROACH  TO  THE  AIALYSIS  OF  THE  CAHDIDATB  SOFTWARE 
RELIABILITY  PREDICTIOI  AID  ESTIMATIOI  MEASURES 

The  overall  approach  taken  to  analyzing  the  data  collected  is 
shown  in  Figure  5-1.  Each  measurement  was  individually  analyzed 
to  determine  its  relationship  to  the  reliability  numbers 
calculated  for  the  various  data  sources.  An  attempt  was  made  in 
most  cases  to  hold  as  many  other  variables  constant  while 
analyzing  the  apparent  relationship  one  measurement  had. 

The  objectives  of  our  analyses  were  to: 

•  Determine  or  establish  the  relationship  each  measurement 
has  with  the  reliability  numbers . 

•  Demonstrate  that  relationship  via  the  data  sources 
available  during  this  project. 

•  Statistically  validate  the  relationship  if  the  data  sample 
is  sufficient. 

•  Document  additional  data  collection  requirements,  metrics 
or  analyses  that  should  be  done. 

In  investigating  the  relationships,  as  many  past  studies  that 
were  appropriate  were  used.  Simple  straightforward  relationships 
were  investigated  first  prior  to  more  complicated  relationships. 
Thus  in  some  cases,  recognizing  that  the  use  of  the  measurement 
was  to  provide  a  sample  or  first  cut  reliability  prediction 
(e.g.,  Application  Type  which  is  identified  via  a  table  look  up), 
the  simple  average  and  variance  of  the  fault  density  experienced 
with  each  application  category  was  calculated.  In  other  cases, 
linear  regression  analysis  was  used  to  statistically  determine 
the  relationship  of  the  metric  to  the  reliability  numbers.  In  a 
few  cases,  non-linear  regression  analysis  was  used. 


5.2  ANALYSIS  OF  THE  DATA 

The  analyses  performed  are  described  in  the  following  paragraphs. 
The  analyses  are  presented  organized  by  measurement.  Results  and 
findings  for  each  metrlo  are  presented  in  these  paragraphs. 
Overall  results  are  described  in  paragraph  5.3. 

5.2.1  Application  Type  (A) 

All  of  the  data  sources  were  used  in  analyzing  the  Application 
Type.  The  goals  of  this  analysis  were  to  establish  baselines  and 
provide  an  initial  reliability  prediction  number.  This  initial 
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prediction  number  could  be  viewed  as  an  industry  average  or 
baseline  for  the  particular  application.  Table  5-1  provides 
averages  for  each  sample  by  Application  Type.  This  table  is  a 
summarization  of  Table  4-2.  Indicated  in  the  table  is  the  number 
of  systems  for  which  data  was  collected  for  that  Application 
Type.  The  total  number  of  systems  in  the  data  base  was  59.  Of 
these  59,  the  number  of  source  lines  of  code  were  reported  for  49 
amounting  to  over  5  million  lines  of  code.  The  average  fault 
density  indicated  is  a  weighted  average,  i.e.  it  is  the  total 
number  of  errors  found  divided  by  the  total  number  of  lines  of 
code  for  all  systems  in  that  application  category.  The  fault 
density  by  system  indicated  is  an  average  of  the  fault  densities 
reported  for  each  system,  i.e.  the  system  size  is  not  taken  into 
account.  A  standard  deviation  for  the  average  fault  density  by 
system  is  given  in  parentheses.  The  failure  rates  shown  are  the 
average  failure  rate  during  formal  testing,  the  failure  rate  at 
the  end  of  the  test  period  and  operations  failure  rate.  The 
failure  rate  is  in  units  of  failure  per  computer  operation  hour. 

The  airborne  applications  consisted  of  eight  different  data 
sources  (systems).  One  large  system  written  primarily  in 
assembly  language  in  the  early  1970s  (data  source  2  -  [FISH79]) 
had  a  fault  density  reported  of  .017.  Two  others  written  in  AED 
(both  approximately  40,000  lines  of  code  each)  were  real-time 
closed-loop  flight  control  systems  and  reported  fault  densities 
of  .0086  and  .0018  [HECH83] .  Four  others  were  flight  control 
programs  on-board  the  ALCM  or  B-1B  [HECH83]  and  had  fault 
densities  reported  as  .0029,  .011,  .021,  and  .027.  A  last 
system,  the  digital  flight  control  system  on  the  Advanced  Fighter 
Technology  Integration  (AFTI)  F-18  program,  reported  a  .08 
failure  rate  (.08  failures  per  operational  flight  hour)  during 
flight  testing. 

The  strategic  systems  data  consists  of  25  different  systems.  Most 
of  these  systems  are  military  C3I  systems,  ground-based  C2 
systems,  NASA  ground  stations,  or  communication  switching 
systems.  The  range  in  fault  densities  reported  was  .054  to  .0001 
and  in  failure  rates,  .028  to  .0007.  The  later  failure  rate 
(.0007)  was  the  most  reliable  system  reported  in  the  data  base 
(data  source  14).  Many  of  the  systems  in  this  application 
category  were  of  significant  size,  over  100,000  lines  of  code. 

The  tactical  systems  data  consists  of  5  systems.  These  ranged 
from  four  command  and  control  applications  (data  source  19)  to  a 
tactical  training  system  (data  source  10).  The  four  C2  projects 
each  Involved  between  10,000  to  20,000  HOL  instructions 
performing  display  management  and  command  execution  in  a  command 
and  control  system  (Projects  1  -  4  in  [MUSA79]).  Individual  data 
for  these  projects  are  presented  in  Table  4-2.  The  fault  density 
entry  in  Table  5-1  is  an  average  of  these  four  plus  the  other 
tactical  system.  These  four  projects  were  carried  out  within  a 
single  organization  and  hence  it  is  not  too  surprising  to  find  a 
fairly  narrow  spread  of  the  reliability  indicators.  The  training 
system  (data  source  10)  was  described  in  Section  4.  It  has  all 
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of  the  ingredients  of  an  operational  tactical  system,  its 
reported  fault  density  was  .0016.  Failure  rate  data  was  also 
captured  for  this  system.  It  was  a  1.04  average  during  testing, 
.63  at  end  of  test  and  .18  during  operation. 

The  Process  Control  Application  Type  was  only  represented  by  two 
data  sources.  This  Application  Type  was  created  to  distinguish 
between  the  critical  nature  of  the  airborne,  strategic  and 
tactical  applications  and  the  production  center  and  developmental 
applications.  It  represents  some  aspects  of  each  of  the  above 
two  groups.  The  two  systems  used  were  an  Emergency  Response 
Information  System  (data  source  17),  described  in  Section  4,  and 
an  Image  Processing  System  (data  source  21).  Fault  densities 
reported  for  these  two  were  .002  and  .0016  respectively. 

The  Production  Systems  category  was  represented  by  fourteen  data 
sources.  These  ranged  from  an  interactive  operating  system  at  a 
university  (data  source  20)  to  interactive  commercial  and 
military  systems  (data  source  13)  to  an  in-house  system  running 
financial  management  systems  (data  source  6)  to  a  Launch  Support 
Data  Base  program  at  Vandenberg  AFB  (data  source  7)  and  telemetry 
processing  for  the  Viking  Project  at  JPL  (data  source  15).  These 
systems  ranged  in  size  between  10,000  lines  of  code  to  one  system 
that  was  1,697,177  lines  of  code.  About  half  of  these  systems 
were  interactive,  transaction  processing  type  systems  while  the 
other  half  were  simply  batch  processing  systems. 

The  Developmental  Systems  are  represented  by  five  systems.  One 
is  data  source  18  which  is  a  data  reduction  system  and  two  are 
the  support  programs  described  in  [HECH83]  (data  source  24).  The 
two  other  systems  (data  sources  31  and  33)  are  simulators.  The 
failure  rates  reported  on  data  source  18  were  very  high  (170  for 
test  average  and  21  for  end  of  test),  ^hls  is  the  only  failure 
rate  data  reported  for  this  category,  so  the  average  may  be 
biased  high. 

Table  5-1  illustrates  the  improvement  in  reliability  expected 
from  failure  rate  average  test,  end  of  test,  and  operational. 

The  data  collected  exhibits,  on  the  average,  a  ratio  of 
approximately  9  to  1  between  the  average  failure  rate  during  test 
to  the  failure  rate  observed  at  the  end  of  test  and  a  ratio  of 
approximately  7  to  l  between  the  failure  rate  at  the  end  of  test 
and  the  operational  failure  rate  (see  Table  5-2).  The  averages 
are  calculated  from  Table  4-2  for  these  data  sources  where 
failure  rates  are  reported  for  each  of  these  pairwise 
comparisons.  The  range  in  the  ratios  of  average  failure  rate 
during  test  to  end  of  test  failure  rate  is  1.7:1  to  41.2:1.  If 
the  one  system  that  exhibited  the  41.2:1  ratio  is  eliminated  then 
the  average  ratio  is  5:1  with  a  range  between  1.7:1  and  8.9:1. 
The  range  in  the  ratios  of  end  of  test  failure  rate  to 
operational  failure  rate  2.5:1  to  11:1  with  the  calculated 
average  of  7:1.  These  ratios  are  potentially  valuable  estimation 
parameters  to  allow  rule  of  thumb  estimates  of  failure  rates  to 
be  expected  at  end  of  test  or  during  operation  based  on  the 
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observed  average  failure  rate  during  testing.  Data  is  needed  for 
the  Airborne  and  Process  Control  Categories  to  complete  this 
table. 

Another  relationship  which  we  had  hoped  to  observe  was  specific 
differences  in  either  fault  density  or  failure  rate  exhibited  by 
the  Application  Categories.  In  Table  5-1  it  can  be  seen  that  the 
Airborne  and  Strategic  Application  categories  exhibited  the  same 
average  fault  density  (.009),  the  developmental  category 
exhibited  the  highest  average  fault  densities  (.011),  the  process 
control  category  exhibited  the  lowest  average  fault  density 
(.0017),  and  the  production  system  and  tactical  categories 
exhibited  fault  densities  of  .0036  and  .0027  respectively. 
Additional  data  sources  in  the  process  control  category  needed  to 
confirm  it  as  having  the  lowest  fault  density.  Our  expectations 
that  the  highly  critical  systems  (exhibited  by  airborne, 
strategic,  and  to  some  degree  tactical  systems)  would  exhibit 
lower  fault  densities  than  other  categories  were  not  met .  Where 
our  expectations  were  consistent  with  the  findings  was  in 
observed  failure  rates.  The  strategic  system  category  had  an 
average  failure  rate  of  .0108  during  operation.  The  airborne 
category  only  had  failure  rate  data  available  from  one  data 
source  and  it  was  an  average  during  test.  It  was  .08  which  was 
significantly  lower  than  the  .34  average  test  failure  rate 
exhibited  by  the  strategic  systems.  Thus  we  could  expect  a 
better  operational  failure  rate  for  the  airborne  systems.  The 
tactical  system  operational  failure  rate  (.108)  was  next  in  the 
expected  hierarchy  of  failure  rates.  The  production  systems 
category  with  a  failure  rate  . 198  was  next  with  the  developmental 
systems  (a  failure  rate  of  21)  last  using  the  end  of  test  failure 
rate  reported  for  one  data  source.  These  differences  are  further 
illustrated  if  failure  rates  are  calculated  for  each  data  source 
in  Table  4-2  for  which  failure  rates  for  end  of  test  or 
operations  were  reported.  Using  these,  averages  for  each 
application  category  are  shown  in  Table  5-3.  In  this  table,  the 
categorization  scheme  recommended  by  Hecht  is  also  shown  based  on 
the  processing  time  constraints  of  the  systems.  Using  this 
scheme,  clear  differences  in  the  failure  rates  observed  are 
exhibited.  The  real  time  applications  had  an  average  failure 
rate  of  .0048,  the  on-line  (interactive,  transaction  processing) 
applications  had  an  average  of  .016,  the  batch  process 
applications  had  an  average  of  .02  and  the  one  developmental 
support  application  had  an  average  of  21.  This  categorization 
scheme  seems  most  promising. 

Figures  5-2a,  b,  o,  and  d  presents  the  data  in  Tables  5-1,  5-2, 
and  5-3  graphically.  Two  general  phenomena  are  observed.  One  is 
that  the  reliability  of  the  more  time  critical  systems  is  higher 
than  less  time  critical  systems  (Figure  5-2c).  This  same  concept 
potentially  holds  for  the  more  functionally  critical 
systems  having  the  higher  reliability  (Figure  5-2b)  but  more  data 
is  required. 

The  other  phenomenon  is  the  reliability  growth  illustrated 
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TABLE  5-3  FAILURE  RATE  BY  APPLICATION  CATEGORY 


tEPORTKD  I  ,  I 

FAILURE  ,  I  T  *T  AB  *°* 


through  the  test  phase  into  operations  (Figure  5-2d).  All 
failure  rates  are  in  Computer  Operation  Hour  (COH). 

An  expected  relationship  not  illustrated  by  the  data  was  related 
to  fault  density  and  application  type  (Figure  5-4a) .  It  appears 
that  the  more  critical  systems  which  are  developed  typically  with 
more  formality  still  exhibit  approximately  the  same  fault 
densities  as  the  non-critical  systems.  This  probably  happens 
because  they  are  subjected  to  more  formal  testing.  The 
differences  show  up  once  the  system  is  fielded  when  the  critical 
systems  exhibit  the  lower  failure  rate  since  most  of  their  faults 
have  been  removed.  The  non-critical  systems  still  contain  many 
faults  and  have  higher  failure  rates. 

The  basic  purpose  of  these  analyses  was  to  develop  an  initial  set 
of  baselines,  which  are  in  Table  5-1  and  Table  5-2. 

5.2.2  Development  Environment  (D) 

As  previously  discussed,  the  development  environment  as  well  as 
the  software  Implementation  are  viewed  as  contributors  to  the 
fault  density  and  are  evaluated  primarily  against  that  measure. 
To  establish  the  prediction  factors  for  the  development 
environment,  two  approaches  are  available: 

•  Gross  statistics  --  determine  the  fault  density  of  many 
software  projects  in  each  class;  and 

•  Selective  comparison  —  determine  the  fault  density  of 
comparable  projects  in  each  class. 

Figure  5-3  illustrates  the  data  available  from  the  data  sources 
relating  the  Development  Mode  metric  to  fault  density.  Note 
within  each  category  of  Development  Mode  there  is  a  scale.  This 
scale  represents  the  rating  derived  from  the  checklist  described 
in  Section  3  (Table  3-7).  That  checklist  Identifies  what 
techniques  and  tools  were  employed  during  the  development.  The 
rating  is  derived  from  a  ratio  of  the  items  checked  divided  by 
the  total  numbers  of  items,  le.  if  19  items  are  checked  of  the 
total  30  the  rating  is  .5.  From  the  limited  data  available, 
there  appears  to  be  a  relationship  which  is  intuitively 
supported;  the  more  formal  tools  and  techniques  employed,  the 
more  faults  found  during  the  development  phase.  The 
relationships  exhibited  by  the  data  in  Figure  5-3  are: 

FD  -  . 109d  -  .04  for  Embedded 

FD  -  -.006d  +  .009  for  Semi-detached 

FD  -  -.Q18d  -  .003  for  Organic 

where  d  is  the  rating  of  the  development  approach  using  the 
checklist  (Table  3-7). 

These  relationships  represent  taking  a  gross  statistical 
technique.  To  have  confidence  in  these  relationships,  data  from 
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a  significant  number  of  projects  (approximately  30  in  each 
category)  would  have  to  be  gathered.  The  current  correlations 
are  not  statistically  significant  but  do  exhibit  an  intuitive 
relationship.  Figures  5-4,  5-5,  and  5-6  illustrate  the 
relationships . 

Selective  comparisons  were  also  made  to  assess  if  more  insight 
could  be  provided  of  the  affect  of  the  development  mode  on 
software  reliability. 

One  such  comparable  observation  will  be  used  as  an  example.  An 
organic  environment  is  represented  by  the  real-time  flight 
control  program  listed  as  data  source  12  in  Table  4-2.  The 
flight  control  software  represented  by  this  data  was  produced  by 
a  group  within  the  flight  control  equipment  manufacturer's 
organization  having  a  considerable  familiarity  with  the 
application.  The  real-time  command  and  control  software 
represented  by  data  source  5  in  Table  4-2,  in  comparison,  was 
produced  in  an  embedded  environment.  Both  software  products 
Involved  approximately  40,000  lines  of  code,  run  under  tight 
timing  constraints,  and  incorporate  modern  programming  practices. 

The  fault  densities  for  these  two  examples  are: 

•  Organic  environment  —  0.005 

•  Embedded  environment  —  0.0085 

If  the  observations  reported  here  carry  through  for  a  larger 
sample,  the  embedded  environment  will  then  be  assigned  a  fault 
density  multiplier  that  is  0.0085/0. 005  -  1.7  greater  than  that 
of  the  organic  environment.  Since  it  is  desired  to  have  the 
unity  value  of  the  parameter  for  a  neutral  environment,  the 
organic  development  environment  will  be  assigned  a  value  of  0.76 
and  the  embedded  environment  a  value  of  1.3,  the  ratio  of  these 
being  1.7.  As  a  check,  the  average  fault  density  for  the 
embedded  data  sources  used  in  Figure  5-3  is  .014  and  for  the 
organic  data  sources  .0082  which  is  consistent  with  the  1.7  ratio 
(.014/. 0082)  calculated  above.  These  summary  relationships 
between  the  development  modes  will  be  used  to  establish  a  basic 
multiplier  for  the  development  environment  metric.  This 
multiplier  will  be  modified  if  information  is  available  to 
complete  the  checklist.  In  this  case,  the  equation  presented 
earlier  are  used. 

5.2.3  Software  Characteristics 

Each  of  the  metrics  described  in  Section  4  were  analyzed  against 
the  fault  density  data  collected.  Some  of  these  metrics  were 
analyzed  at  the  system  or  subsystem  level,  others  at  the  CSC  or 
unit  level.  Where  the  analyses  were  performed  at  the  CSC  or  unit 
level,  data  sources  10  and  17  were  used. 
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R-squared:  Std.  Err,:  Coef. 
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8.3.3. 1  Anomaly  Management .  Trace ability  and  Quality  Review 

The  Anomaly  Management  metrio  and  Quality  Review  metric  scores  as 
applied  to  data  source  10  are  in  Table  8-4.  These  metrics  were 
applied  at  a  CSC  (process)  level  since  the  design  documentation 
was  written  with  that  orientation.  The  results  of  the 
statistical  analysis  of  these  scores  versus  the  fault  density 
reoorded  are  in  Figures  8-7  and  8-8.  As  can  be  seen,  neither 
analysis  provided  significant  results,  i.e. ,  results  that  could 
be  used  for  prediction.  Both  metrics  demonstrated  a  correlation 
with  fault  density,  i.e.  as  the  metrio  score  went  up,  the  fault 
density  went  down,  but  the  relationship  was  not  significant 

statistically.  The  Quality  Review  results  were  disappointing. 
The  results  ezpeoted  should  have  supported  Lipow's  findings  in 
[ LIP079]  where  units  which  had  many  design  problems  also  were 
ones  that  had  the  most  implementation  problems. 

Further  investigation  revealed  the  following: 

e  Processes  with  an  AM  score  greater  than  .6  had  a  fault 
density  of  .0008. 

e  Processes  with  an  AM  score  between  .4  and  .6  had  a 

fault  density  of  .001. 

s  Processes  with  an  AM  soore  less  than  .4  had  a  fault 

density  of  .004. 

This  analysis  lends  itself  to  developing  a  metric  with  a 
multiplier  based  on  the  above  findings.  A  conservative  approach 
will  be  taken  assigning  a  multiplier  of  . 9  for  an  AM  score 
greater  than  .6.  1  for  an  AM  soore  between  .4  and  .6,  and  1.1  for 
a  score  less  than  .4.  A  similar  relationship  was  found  with  the 
Quality  Review  metric.  Utilizing  a  QR  score  . 5  as  a  divider,  QR 
scores  higher  had  an  average  fault  density  of  .0007  and  QR  scores 
lower  had  an  average  fault  density  of  .0016.  Again  utilizing  a 
conservative  approaoh,  a  multiplier  of  1.1  was  assigned  to  SQ  if 
the  metrio  score  was  lower  than  .5. 

An  attempt  was  made  to  assess  traceability.  Without  the  use  of  a 
formal  requirements  specification  language  such  as  PSL/PSA  or 
SREM  or  a  significant  expenditure  of  labor  to  establish  a 
traceability  matrix  utilizing  a  tool  such  as  RTT ,  this  was  very 
difficult  to  do  within  the  scope  of  this  project  for  systems  as 
large  as  data  source  10  and  17. 

Additional  analyses  are  needed  to  establish  whether  these  metrics 
can  be  used  as  predictors.  See  Section  7  for  recommendations  and 
plans . 

9. 2. 3. 2  Software  Implementation  Characteristics 

Table  5-5  contains  a  summarization  of  the  data  collected  from 
data  sources  10  and  17  to  analyze  the  software  implementation 


PROCESS 


ANOMALY 

MGMT 


QUALITY 

REVIEW 


#PR 


FD 


*  Process  not  available  at  design 
••Process  either  deleted  or  combined  with  other 
processes  in  implementation 
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12 
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9 
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17 
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98 
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18 
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150 
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53 
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24 
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744 

30 
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31 
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■a 

105 

characteristics.  Data  Source  10  CSC's  are  identified  by 
Processes  101-307.  Data  Source  17  CSC's  are  identified  by 
Processes  401-409.  The  following  paragraphs  describe  our 
analyses  for  each  metric. 

Language 

The  Language  metrio  was  evaluated  in  [HECH83]  for  a  significant 
sample  of  programs.  Typical  data  from  that  study  are  shown  in 
Table  5-6.  For  post-1977  programs,  the  average  fault  density  of 
assembly  programs  was  found  to  be  .0103  and  that  of  HOL  programs 
was  found  to  be  .0075  (both  are  here  expressed  as  a  ratio  of 
faults  to  the  source  statements  whereas  in  the  reference  they  are 
given  as  percentages  of  equivalent  assembly  statements).  If  HOL 
is  used  as  the  baseline  (metric  -  1),  assembly  language  code 
therefore  carries  a  multiplier  of  .0103/. 0075  -  1.4. 


TABLE  5-6. 

EFFECT  OF  LANGUAGE  ON  RECENT  PROGRAMS 


Program  Attribute 

-+ - +- 

i  Assembly  i 

HOL 

Number  of  Programs 

1  6  l 

15 

Program  Size* 

1  100K  i 

1,124k 

Average  Fault  Density** 

1  .0103  1 

.0015 

Range  of  Fault  Density 

1  .0015  -  .0521  l 

-+ - +- 

.0001  -  .0086 

*  Equivalent  executable  assembly  statements 

**  Fault  density  -  No.  of  faults  per  line  of  exectuable  code 


Most  of  the  High  Order  Language  (HOL)  programs  included  in  this 
sample  were  written  in  FORTRAN.  Two  programs  were  written  in  the 
AED  programming  language,  generally  considered  to  represent  a 
more  primitive  type  of  HOL,  and  these  had  an  average  fault 
density  of  .0052.  Because  of  the  small  size  of  that  sample  it 
may  be  premature  to  establish  a  differentiation  based  on  the  type 
of  HOL  in  which  the  program  is  implemented.  None  of  the  programs 
in  that  sample  were  written  in  a  block-structured  HOL.  PASCAL 
and  Ada  programs  should  be  examined  and  their  reliability 


attributes  examined 

to 

determine  whether  they 

differ 

significantly  from  those 

of 

FORTRAN  programs. 

For  earlier  programs , 

the 

following  fault  densities  in 

percent 

are  reported  in  (NELS78) 

FORTRAN  (18) 

.0151 

COBOL  (9) 

.0129 

PL/1  (2) 

.0333 

CENTRAN  (3) 

.0194 

Assembly  (24) 

.0266 

The  number  of  programs  involved  is  indicated  in  parentheses  after 
each  language.  The  unweighted  average  fault  density  of  the  four 
high  order  languages  is  .0202;  the  average  weighted  by  the  number 
of  programs  involved  is  .016.  The  ratio  of  assembly  to  HOL  fault 
densities  is  1.3  and  1.6,  depending  on  the  method  of  averaging. 

Using  fifteen  more  projects  from  the  current  data  base  that  were 
implemented  in  a  single  language  each,  the  following  additional 
fault  densities  are  reported: 


FORTRAN  (6) 

.017 

JOVIAL  (2) 

.001 

COBOL  (1) 

.0012 

C  (4) 

.0085 

AED  (2) 

.005 

ASSEMBLY  (4) 

.0148 

Again,  calculating  the  average  HOL  fault  density  to  be  .0114  and 
dividing  this  into  the  Assembly  language  fault  density  (.0146),  a 
ratio  of  1.3  is  derived.  This  is  in  very  good  agreement  with  the 
findings  reported  above  and  indicates  that  the  multiplier  for 
assembly  language  is  reasonably  firm. 

Reuse 

The  extent  of  prior  use  is  documented  for  many  programs  in  the 
Goddard-SEL  data  base.  Table  5-7  lists  the  percentage  of  re-used 
and  modified  lines  of  code  of  programs  for  which  the  fault 
density  had  been  computed  in  [HECH83] .  These  programs  were 
developed  in  a  reasonably  uniform  environment  between  1977  and 
1980.  They  comprise  from  14,000  to  200,000  executable 
statements.  The  primary  language  is  FORTRAN  with  assembly 
segments  that  range  from  13%  to  28%  of  the  code. 

Two  analysis  were  conducted  on  this  data  sample.  The  first  one 
considered  only  the  percentage  of  re-used  code  and  resulted  in 
the  following  findings  (Table  5-8) : 

TABLE  5-8 

PRIOR  USB  OF  CODE  FOR  SELECTED  SEL  PROGRAMS 


Percent 

Re-used 

1 

1 

1 

NO.  Of 
Systems 

- + - 

1 

1 

1 

Avg .  Fault 
Density 
by  System 

+ - 

i  Weighted 

i  Avg .  FD 
i 

<  10 

1 

2 

l 

0.00215 

1  .00058 

10  -  20 

1 

3 

l 

0.0012 

1  .00125 

>  20 

1 

4 

l 

0.0011 

1  .00068 
--  + - 

The  second  analysis  considered  re-used  code  and  50%  of  the 
modified  code  (together  termed  Re/Mod  Code)  and  yielded  the 
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following  results  (Table  5-9): 


a 


TABLE  5-9 

REUSED  AND  MODIFIED  CODE  IMPACT  ON  FAULT  DENSITY 
+ - + - + - + - + 


Percent 

Re-used 

1  No.  Of 

i  Systems 

i 

l 

1 

1 

Avg .  Fault 
Density 
by  System 

Weighted 

1  Avg .  FD 

1 

«  15 

i  1 

l 

0 . 0042 

i  .0042 

15  -  30 

1  2 

1 

0.0003 

1  . 0003 

>  30 

1  6 

l 

0.00125 

1  .0012 

+ - + - + - + - + 

Both,  analyses  did  not  find  a  conclusive  relationship  between 
fault  density  and  re-used  code.  From  the  limited  data  currently 
available,  no  predictive  relationship  could  be  developed.  Other 
programming  environments  need  to  be  explored  in  order  to  assess 
if  representative  and  accurate  predictor  can  be  developed. 

Size  of  Code 

Comparisons  of  fault  density  for  programs  of  different  size  are 
currently  available  from  three  sources,  [HECH83] ,  (NELS78] .  and 
this  study.  The  former  includes  16  programs  (at  least  75%  of 
each  coded  in  HOL),  all  of  which  were  developed  between  1978  and 
1980  in  a  disciplined  programming  environment;  [NELS78 ]  comprises 
52  programs  developed  prior  to  1977  in  a  variety  of  languages 
(including  many  assembly  programs)  and  programming  practices. 
This  study  includes  most  of  the  systems  in  [HECH83]  plus 
additional  ones.  The  effect  of  size  on  fault  density  is  shown  in 
Table  5-10.  The  data  collected  during  this  study  is  portrayed 
graphically  in  Figure  5-9. 


TABLE  5-10. 

EFFECT  OF  SIZE  OF  CODE 


1 

Fault  Density,  Percent 

1 

i 

1 

Source: 

l 

i 

Program 

Size  (DSLOC) 

1 

HECH83 

NELS78 

1 

This  Study  i 

< 

10K 

1 

1 

.001* 

1 

1  .034 

1 

1 

.054*  i 

10K 

-  49. 9K 

1 

.0036 

1  .0084 

l 

. 0074  i 

5  OK 

-  99. 9K 

1 

.0021* 

1  . 0087*  * 

1 

.0195  l 

> 

100K 

1 

1 

.001 

1  .0124 

1 

1 

1 

.0088  1 

i 

+ 


+ 


+ 


+ 
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*  Class  comprises  a  single  program 
**  Excluding  one  program  at  .14. 
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The  overall  trend  seents  to  indicate  that  large  programs  have  a 
lower  fault  density  than  small  ones  which  is  counter-intuitive. 
Possible  explanations  are  a  greater  amount  of  re-used  code  in 
large  programs  and  a  more  disciplined  programming  environment . 
In  the  NELS78  data  set,  it  is  quite  likely  that  the  large 
programs  made  more  use  of  HOLs. 

Figure  5-9  could  be  misleading  because  of  the  two  extremely  large 
systems.  Figure  5-10  is  a  regression  using  the  same  data  except 
those  two  large  systems.  This  figure  shows  even  less  correlation 
and  highlights  the  fact  that  size  does  not  appear  to  be  related 
to  the  fault  density. 

At  a  CSC  level  within  a  system,  the  relationship  is  more 
consistent  with  expectations.  Figure  5-11  illustrates  the 
correlation  found  between  size  and  fault  density  in  data  source 
10  where  size  of  CSCs  are  plotted. 


The  effect  of  module  size  on  fault  density  has  been  evaluated  on 
the  basis  of  data  from  data  sources  1,  4,  10,  11,  17,  21,  and  29. 
Data  source  1  is  predominantly  written  in  JOVIAL/ J3  and  was 
tested  over  a  three  year  period  that  ended  prior  to  mid-1977. 
Thus,  program  development  is  presumed  to  have  started  prior  t 
1974.  No  structured  design  was  involved.  The  average  faul 

density  for  module  size  classes  is  shown  in  Table  5-11.  Size  is 
expressed  in  source  code  statements. 


TABLE  5-11. 

EFFECT  OF  MODULE  SIZE: 
DATA  SOURCE  1 
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FIGURE  5-10  MODIFIED  ANALYSIS 
OF  SIZE  VS  FAULT  DENSITY 
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Data  collected  during  this  effort  is  more  intuitively  supportive. 
Data  collected  from  data  sources  10  and  17  is  in  Table  5-12. 
Here  units  which  were  under  200  lines  of  code  performed  extremely 
well . 


TABLE  5-12. 

EFFECT  OF  MODULE  SIZB: 
DATA  SOURCE  10,  17 


+ - + - + - + 

I  NO.  Of  I  I  l 

i  Processes  i  Executable  Statements/Unit  i  Fault  Density  i 

+ - + - + - + 

I  3  I  <50  10  i 

l  3  I  50  <  <  100  l  0  i 

I  9  I  101  <  <  200  I  0  i 


+ 


+ 


+ 


+ 


l 

+ 

I 

I 

I 
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15 

1  TOTAL 
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o  o  to 
o  o  o 
o  o  ►-* 

<  999 
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l 

i 

.0017 

.0014 
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1 

1 

1 

29 

I  TOTAL 

>  2000 

i 

.0015 

l 

+ 


+ 


-t- 


Data  available  from  data  source  11  is  shown  in  Table  5-13. 

TABLE  5-13. 

EFFECT  OF  MODULB  SIZE: 

DATA  SOURCE  11 


Statements/Modula 

-+ — 

i 

q£  Modules 

1 

Fault  Density 

<  100 

i 

23 

l 

.094 

100  -  1,000 

i 

4 

l 

.044 

>  1,000 

1 

| 

1 

1 

| 

.047 

-  + - 

In  [GRAS82] ,  a  relationship  between  module  size  and  number  of 
problem  reports  was  found  to  be: 

PR's  -  .012  S  -  9.3 

where  PR's  -  Number  of  Problem  Reports 
and  S  -  Number  of  Lines  of  Code 

In  [MOTL76],  the  relationships  shown  in  Figure  5-12  were 
developed. 

The  obvious  conclusion  is  that  no  consistent  relationship  could 
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SOURCE  INSTRUCTIONS  PER  PROGRAM 


FIGURE  5-12  ERROR  RATE  AND  SOURCE  INSTRUCTION 
RELATIONSHIP  FOR  [MOTL  76] 
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few  modules,  does  the  fault  density  exhibit  the  expected  relation 
to  complexity.  In  all  other  subclasses,  the  effect  of  complexity 
(as  assessed  here)  on  fault  density  seems  to  be  random. 

A  subjective  evaluation  of  complexity  as  "easy'',  "medium",  and 
"hard"  is  also  provided  in  the  SEL  Component  Summary  Form,  but  no 
analysis  of  that  information  relative  to  fault  density  was 
performed  since  it  was  assumed  it  would  not  provide  conclusive 
data. 

Use  of  the  data  collected  in  Table  5-4  for  Data  Sources  10  and 
17  to  quantitatively  calculate  a  complexity  metric  based  cn  the 
McCabe  cyclomatic  complexity  metric  and  relate  that  to  fault 
density  exhibited  better  results.  Figure  5-13  illustrates  the 
results  of  the  regression  analysis  using  the  McCabe  complexity 
metric  for  data  source  10  and  17.  The  relationship  illustrated 
here  is: 


FD  -  -.009  C  +  .001 


The  negative  slope  is  consistent  with  the  way  we  have  defined  the 
complexity  metric,  i.e.  as  the  metrio  approaches  zero  complexity 
increases.  The  correlation  coefficient  is  not  supportive  of 
using  the  above  relationship  generally.  What  is  apparent  from 
the  plot  of  data,  however,  is  that  the  processes  with  a  McCabe's 
metric  greater  than  .05  (which  is  a  cyclomatic  complexity  of  20) 
are  more  likely  to  be  these  procoesses  with  a  higher  fault 
density.  Based  on  this  observation,  a  multiplier  of  1.5  is 
recommended  for  modules  with  a  complexity  greater  than  20,  1  for 
modules  with  a  complexity  between  7  and  20,  and  .8  for  those 
modules  with  a  complexity  less  than  7.  The  overall  multiplier 
will  be  a  weighted  average  of  those  scoores  by  the  number  of 
modules  in  each  category. 

Standards  Bevlew 


The  Standards  Review  represents  code  inspections,  walkthroughs 
or  standard  enforcement  results.  In  Table  5-4  there  are  a  number 
of  data  elements  which  make  up  the  Standards  Review  Checklist 
described  in  Volume  II.  Figures  5-14  through  5-19  illustrate  the 
correlations  found  between  various  measurements/elements  and  the 
number  of  problems  found  in  a  process.  The  ones  illustrated  in 
these  figures  are: 


Figure  5-14: 

Figure  5-15: 

Figure  5-16: 
Figure  5-17: 
Figure  5-18: 
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Figure  5-20  illustrates  a  non-linear  regression  analysis.  This 
is  the  same  metric  and  data  as  shown  in  Figure  5-19.  The 
non-linear  regression  analysis  resulted  in  a  slightly  better  fit . 

The  regressions  were  calculated  using  number  of  problem  reports 
and  fault  density.  Better  correlations  were  found  with  number  of 
problem  reports  as  the  independent  variable  and  those  analyses 
are  presented  here.  We  found  in  data  sources  10  and  17  that  over 
60  percent  of  the  processes  (CSC's)  had  no  problems  reported 
against  them.  Only  15  percent  had  more  than  3  problems  written 
against  them  which  based  on  the  average  size  of  a  process  equated 
to  a  fault  density  greater  than  .0015. 

A  key  use  then  of  these  metrics  for  improving  S/W  reliability  is 
to  pinpoint  these  problem  modules  for  predictive  purposes  but 
primarily  for  identification  and  correction.  As  an  illustration 
of  this  concept,  using  the  metrio,  number  of  data  items,  to 
identify  the  potential  problem  modules,  we  flagged  all  processes 
that  have  more  than  the  average  number  of  data  items  (997).  In 
retrospect,  this  technique  would  have  identified  86  percent  of 
the  problem  modules.  The  identification  is  not  perfect,  i.e. 
other  modules  were  also  identified  by  the  metric  that  were  not 
problem  modules  by  our  definition.  But  the  predictive 
performance  seems  excellent.  The  results  were: 

•  42%  of  all  processes  flagged 

•  84%  of  processes  flagged  had  problems 

•  Identified  88%  of  all  process  with  problems 

•  Identified  86%  of  all  problem  processes  (those  with 
fault  densities  higher  than  the  average  for  the 
overall  system) . 

For  purposes  of  prediction,  the  metrio  recommended  is  based  on 
the  percentage  of  problem  modules  identified  by  the  metrics.  If 
over  half  of  the  modules  are  flagged  as  potential  problem  modules 
by  the  metrics  applied  as  a  standards  review  then  the  predicted 
reliability  should  be  raised  since  the  expected  problems  seem 
manageable.  In  data  sources  10  and  17,  the  problem  processes  had 
a  fault  density  of  .0035,  twice  the  average  fault  density  of  the 
system,  .0017.  These  problem  processes  accounted  for  15  percent 
of  the  processes.  Thirty-eight  (38)  percent  of  the  processes  had 
problems  with  an  average  fault  density  of  .0024,  1.4  times  the 
average.  For  prediction  purposes  then,  the  following  multipliers 
are  recommended  (Table  5-15): 
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TABLE  5-15 

RECOMMENDED  SR  METRIC 


Standards  Reviev 
Metrio  (SR) 


Percent  of  Modules  Flagged 
as  Potential  Problems 


>  50 

50  to  25 
<  25 


This  approach  is  recommended  based  on  the  data  observed  In  data 
sources  10  and  17.  A  larger  sample  is  required  to  derive  an 
actual  prediction  equation  as  described  in  Section  3. 

5.2.4  Test  Metrics 


Three  data  sources,  10  ,  17  and  26,  were  used  for  demonstrating 
the  Test  Effort  metric.  Table  5-16  presents  the  data  available 
from  the  three  data  sources. 

TABLE  5-16. 

TEST  EFFORT  VERSUS  FAULT  DENSITY/FAILURE  RATE 


Data  Source  i  Test  Effort  i  Fault  Density  i  Failure  Rate  i 


»  • 


>!y> 

A  . 
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>  • 

M  *  It  W* 


»  • 


and  17  (shown  in  Table  4-4).  As  shown  in  Table  5-17,  the  higher 
scoring  test  methodology  is  related  to  the  lower  fault  density 
which  is  intuitive. 


TABLE  5-17. 

TEST  METHODOLOGY  METRIC  VERSUS  FAULT  DENSITY 


if 

■ 

H 

y 

H 

TEST  COVERAGE 

No  analysis  was  performed  on  Test  Coverage. 
EAILTJRE.-RAlE.-T£ENDS.JDURiyG . .TEST 

Using  the  findings  presented  in  Table  5-2  and  Figure  5-2d,  a 
multiplier  of  .2  can  be  used  to  estimate  the  failure  rate  at  end 
of  test  based  on  the  average  failure  rate  observed.  A  multiplier 
of  . 14  can  be  used  to  estimate  operational  failure  rate  based  on 
end  of  test  failure  rate. 

5.2.5  Operational  Estimation  Metrics 

Workload 

Significant  effects  of  workload  on  software  failure  rates  have 
been  reported  by  investigators  at  Stanford  University  [ROSS82] . 
The  hazard  function,  the  incremental  failure  rate  due  to 
increasing  workload,  ranges  over  two  orders  of  magnitude.  This 
indicates  that  the  workload  must  be  taken  into  account  in 
arriving  at  software  reliability  predictions. 

For  military  applications,  workload  effects  can  be  particularly 
important.  During  time  cf  conflict,  the  workloads  can  be 
expected  to  be  exceptionally  heavy,  causing  the  expected  failure 
rate  to  increase,  and  yet  at  that  same  time  a  failure  can  have 
the  most  serious  consequences.  Hence,  predictions  of  failure 
rates  that  do  not  take  workload  effects  into  account  fall  to 
provide  the  information  that  Air  Force  decision  makers  need. 

The  mechanism  by  which  workload  increases  the  failure  rate  is  not 
completely  known,  but  it  is  generally  believed  to  be  associated 
with  a  high  level  of  exception  states,  such  as  busy  I/O  channels, 
long  waits  for  disk  access,  and  possibly  increased  memory  errors 
(due  to  the  use  of  less  frequently  accessed  memory  blocks).  Data 
presented  in  [IYER81]  show  that  the  highest  software  (and  also 
hardware)  failure  rates  were  experienced  during  the  hours  when 
the  highest  levels  of  exception  handling  prevailed. 
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Details  of  workload  effects  on  software  failure  rate  are  still  a 
research  topic,  and  no  specific  work  on  a  prediction  function  was 
performed  as  part  of  the  present  effort.  Data  from  data  source 
10  substantiates  the  range  of  failure  rates  during  operation. 
Table  4-5  and  Figure  4-14  illustrated  the  fluctuation 
encountered.  Discounting  the  spikes  in  Figure  4-14  (these 

represented  installation  of  enhanced  versions  of  the  system)  the 
range  in  problem  reporting  was  20  to  1  during  operations. 

The  prediction  function  advocated  is  based  on  published  work  (see 
Figure  5-21  which  is  reproduced  from  [ROSS82]).  The  quantity 
plotted  along  the  vertical  axis  is  the  inherent  load  hazard, 
z(x),  defined  as: 

Prob.  of  failure  in  load  interval  (x,  x+c,x)/Prob.  of  failure 
in  interval  (0,x). 

It  measures  the  incremental  risk  of  failure  involved  in 

increasing  the  workload  from  x  to  x+§x. 

The  horizontal  axis  shows  three  different  measures  of  workload: 

•  Virtual  memory  paging  activity,  number  of  pages  read  per 
second  (PAGEIN); 

•  Operating  system  overhead,  fraction  of  time  not  available 
for  user  processes  (OVERHEAD) ;  and 

•  Input /output  activity,  number  of  non-spooled  input/ output 
operations  started  per  second  (SIO). 

These  graphs  provide  an  option  of  predicting  workload  effects  by 
any  of  the  indicators  of  workload  used  here.  The  fraction  of 
overhead  usage  is  probably  the  most  commonly  obtainable  quantity. 
From  a  practical  point  of  view,  before  a  computer  Installation 
becomes  operational,  the  fraction  of  capacity  to  be  used  at 
maximum  expected  workload  is  probably  the  only  indication  of  this 
factor  that  will  be  available  early  in  the  development. 

In  [TROY86] ,  data  sousrce  27,  a  function  was  developed  relating 
software  failures  to  user  logins.  That  function: 

y  -  7.39  +  4.72  *  1CT3  x 

where  y  -  number  of  software  failures 
and  x  -  number  of  user  logins 

had  a  correlation  coefficient  of  .44.  The  user  logins  could  be 
viewed  as  an  expression  of  workload. 

Variability  of  Data  and  Control  States 

Software  that  is  delivered  for  Air  Force  use  is  essentially  fault 
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free  for  nominal  data  and  control  states,  i.e.  ,  where  an  input  is 
called  for,  an  input  fully  compliant  with  the  specification  win 
be  present,  when  an  output  is  called  for,  the  channel  for 
receiving  the  output  will  be  available.  A  major  factor  in  the 
occurrence  of  failures,  and  therefore  affecting  the  failure  rate, 
is  the  variability  of  input  and  control  states  and  the  abnormal 
data  encountered. 

Variability  of  the  input  data  is  the  primary  determinant  of 
software  reliability  in  some  models,  suoh  as  the  ones  proposed  by 
Nelson  and  Llpow  [DACS79]  and  Roger  Cheung  [CHEU811.  Neither  one 
of  these  models  is  supported  by  sufficient  data  to  permit  direct 
evaluation  of  the  effect  of  variability  on  failure  frequency. 
Nelson  and  Lipow  propose  partitioning  of  the  input  data  set,  and 
an  index  of  variability  can  then  be  derived  from  the  number  of 
partitions  accessed  during  one  time  period  or  one  run.  This 
appears  practical  in  only  a  very  limited  number  of  applications. 
Cheung  uses  the  oalllng  sequence  as  an  indicator  of  variability, 
a  somewhat  more  easily  implemented  measure,  but  still  targeted 
primarily  to  a  research  environment.  A  major  difficulty  with 
these  approaches  is  that  guidelines  for  their  implementation  can 
be  provided  only  for  a  narrow  spectrum  of  software  applications. 
The  partitioning  of  input  states  differ  vastly  between  an 
operational  flight  control  program,  a  message  forwarding 
protocol,  or  a  scientific  computation. 

It  is  proposed  to  use  the  frequency  of  exception  conditions  as  a 
practical  measure  of  variability  in  the  current  effort. 
Exception  states  include: 

•  Page  faults,  Input/output  operations,  waiting  for 
completion  of  a  related  operation  —  the  frequency  of  all 
of  these  is  workload-dependent  and  the  effect  on  software 
reliability  is  discussed  in  the  next  section; 

e  Response  to  software  deficiencies  such  as  overflow,  zero 
denominator,  or  array  index  out  of  range;  and 

e  Response  to  hardware  difficulties  such  as  parity  errors, 
error  correction  by  means  of  code,  or  noisy  channel. 

The  last  two  of  these  combined  in  the  input  variability  modifier 
for  the  operating  environment,  EV.  Data  presented  in  [IYER81], 
illustrated  in  Table  5-18,  Indicates  that  approximately  1,000 
exception  conditions  of  the  latter  two  types  were  encountered  in 
5,000  hours  of  oomputer  operation.  A  value  of  0.2  exception 
conditions  per  computer-hour  has  therefore  been  adopted  as  the 
baseline,  to  be  equated  to  unity.  Because  failures  may  arise 
even  if  no  exception  conditions  at  all  are  encountered,  it  is 
desirable  to  bias  the  modifier  to  a  small  positive  value,  a 
suggested  form  is 


where  E  is  the  number  of  exception  conditions  per  hour.  For  E  - 
0.2,  EV  -  1. 


In  [TROY86] ,  a  function  was  derived  relating  software  failures  to 
hardware  failures.  That  function: 

y  -  2.943  +  .7189  X 

where  y  -  number  of  software  failures 
and  x  -  number  of  hardware  failures 

had  a  fairly  good  correlation  coeeficient  of  .7.  The  hardware 
failures  are  obviously  a  form  of  exception  conditions  which 
[IYER83]  related  to  software  failures. 

TABLE  5-18.  SUMMARY  OF  EXCEPTION  CONDITIONS 
FOR  AN  IBM  3801  [IYER83] 


ERROR  TYPE 

HARDWARE 
DETECTED 
Freq.  % 

SOFTWARE 
DETECTED 
Freq.  % 

ALL 

% 

STORAGE  MANAGEMENT 

i 

i 

11 

1 

1 

1.9 

i  i 

l  395  i 

44.2 

1 

1 

26.2 

STORAGE  EXCEPTIONS 

1 

1 

382 

1 

1 

67.0 

1  1 

1  0  I 

0.0 

1 

1 

24.7 

DEADLOCKS 

i 

0 

1 

1 

0.0 

l  310  1 

34.6 

1 

1 

20.2 

1/09  DATA  MANAGEMENT 

1 

45 

1 

1 

7.9 

1  1 

1  116  i 

13.0 

1 

1 

10.5 

PROGRAMMING  EXCEPTIONS 

1 

114 

1 

1 

19.9 

1  1 

1  0  l 

0.0 

1 

1 

7.4 

CONTROL 

1 

18 

1 

1 

3.2 

1  1 

1  50  l 

5.6 

1 

! 

4.4 

INVALID 

i 

i 

1 

1 

1 

0.1 

\  \ 

l  23  l 

l  1 

2.6 

i 

i 

i 

6.6 

ALL 

1 

-  + 

57 

1 

-  + 

100.0 

1  894  1 
-+ - +  “ 

100.0 

1 

100.0 

3.2.6  Other  Analyses 

The  data  collected  afforded  additional  analyses  opportunities. 
For  example,  data  about  the  types  of  problems  reported  were 
available  from  data  sources  1,  2,  3,  4,  5,  16,  27,  28,  29,  and 
31.  The  fault  categorization  scheme  used  was  originally 
presented  in  [ THAY76 ]  and  is  the  most  widely  used  scheme  in  the 
industry.  Table  5-19  presents  the  data  by  data  source  and  in 
summary  form. 


Table  5-20  provides  a  breakdown  by  functional  category  for  four 


data  sources.  Eventually  failure  rates  for  these  functional 
categories  of  software  should  be  sought  to  assess  differences  in 
failure  rate  at  this  level  of  detail. 

Table  5-21  illustrates  the  fact  that  a  small  percentage  (6)  of 
the  problems  found  are  of  a  highly  critical  nature.  Five  systems 
were  used  to  collect  these  data.  Almost  half  of  the  problems 
reported  are  low  criticality. 

These  additional  analyses  provide  data  to  which  future  projects 
can  be  compared. 

9.3  RESULTS  OF  ANALYSIS 

The  analyses  performed  using  the  59  systems  provided  significant 
insight  into  software  reliability.  The  data  base  created  will 
provide  an  excellent  basis  from  which  to  expand  and  further 
refine  the  relationships  developed  during  this  study.  The 
immediate  results  were  somewhat  mixed.  Tables  5-22  and  5-23 
summarize  the  results.  Table  5-22  illustrates  our  expectations 
(documented  in  Section  3)  for  each  metric  and  what  was  realized 
(described  in  Section  5).  The  fact  that  specific  statistically 
valid  relationships  were  not  derived  for  many  of  the  metrics 
suggests  one  of  the  following: 

(1)  There  isn't  a  relationship  and  the  metric  should 
not  be  used 

(2)  Our  sample  size  was  too  small 

(3)  Some  refinement  in  the  metric  is  needed 

The  use  of  multipliers  based  on  a  table  look  up  is  dissappointing 
from  a  theoretical  viewpoint  because  specific  relationships  were 
the  goal  of  the  research.  Yet  the  table  look  up  approach  is 
based  on  observed  relationships  from  data  collected  therefore 
represents  the  perceived  impact  on  reliability. 

The  metrics  recommended  for  use  based  on  this  analysis  are 
indicated  in  Table  5-23.  In  all  cases,  further  data  collection 
and  analysis  would  be  beneficial.  The  available  metrics  are 
documented  in  a  Guidebook  (Volume  II)  to  facilitate  their 
application  as  software  reliability  predictors  and  estimators. 
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TABLE  5-22  SUMMARY  OF  ANALYSIS 


METRIC 

EXPECTED  FORM 

OF  RELATIONSHIP 
(SECTION  3) 

CURRENT  RECOMMENDED 
APPROACH  BASED  ON  DATA 
(SECTION  SI 

AppLicanon  (A) 

Table  of  Average 

Fault  Densities 
by  Category 

Table  of  Average 

Fauit  Densiaes 
by  Category 

Development 
Fnvironment  (D^ 

Du*D 

D=r^> 

V.here  Df)=  1  3  (Ti) 

1  iS> 

’ft(O) 

orDM  = 

(.109  .  41/014  (E) 

1  008  r^-  0031/ 013  iS) 

(  .018  „  .003)/  3)08  (O) 

k: 

where  =  Checklist  Score 

between  0  and  l  restrict 
range  of  to  .5  to  2 

.Anomaly 

Management  (SA) 

ka.-Ym 

SA  =  9  if  AM  >  6 

1  if  4  <  AM  < 

1 .1  IF  AM  <  4 

Traceabditv 
<  ST) 

ku.TC 

TC-NR..NR  OR) 

ST  =  1.1  if  tNRARi/NR<  9 

1  if  (NR-AR),NR_?9 

Quahtv  Keviev*. 

SQj  ' 

kq  ,\K..NR-NDRi) 

SQ  =  1  1  if  DR, "NR  >  5 

1  if  DR/NR  s.  5 

I  .anguage 

(SL) 

S  llol  -  )  4vc  Al. 

SL  =  1  (TcIlOL)  ♦  1.4  (StAI.) 

Sue 

(SS) 

Sstn  1fi.ocs.10K 

SsiC)  if  1DK  LOC  <  50K 

Sstii  if  sok  i.ocsjnoK 

S.n4>  if  !I«  K  <  ioc 

No  Relationship  found 

Modularity 

(SM) 

Smi  1 )  if  Ms 200 

SmtC)  if  COO  e  M  <  3000 
Smi  3i  if  3t X.X.)  <  Nf~ 

SM  =  9  u  +  w  *  2x 
where  u  is  no  of  mods  <  200 

w  is  no.  of  mods  between 

200  and  3000 
r  is  no.  of  mods  >  3000 

Reuse 

(SIS) 

S C ( i)  for  %  of  revised 
code 

No  Relationship  Found 

Complenirv 

(SX) 

kx  X  Sx(  i >/n 

S*  =  1  5a  ♦  b  ♦  8c 
sxheTe 

a  is  no.  of  mods  with  C^20 
b  ls  no.  of  mods  20  >  C_>  7 
c  is  no  of  mods  C  <  7 

Standards 

Review 

(SR) 

kr  i  n/cn  PR ) 

SR  =  1  5  if  PR/N'M^  5 

1  if  5  >  PR,NM  >  C5 
is  if  PR.NM  <  C5 

Test  Effort 

(TE) 

— 

40/AT 

or 

TT  (0 

TE  =  9  if  40  AT<  l 
otherwise  -  1 

Test  Methodology 

(TM) 

ktc  •  TT,TV 

TM  =  9  for  TT.TU _2  ^5 

1  for  75  >  IT/TU  >  5 

11  for  TT/TU  <  5 

Test  Coverage 
(TO 

kiu.VS 

TC  =  l/VS 

Workload 

rt-W) 

kew  •  I  T  1  IT  -S* 

EW  =  IT/IHT  OS) 

Input 

VjLTUbliltV 
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.1  -45-1  C 

EV  =  1  -  4  5  EC 
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Test  Methodology- 

Test  Coverage 
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Input 

Variability 
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6.0  EXPKRIEEFTAL  APPLICATIOV  AID  ASSESSMENT 


6 . 1  Experiment 

In  order  to  assess  the  approach  that  was  derived  during  this 
project,  an  experiment  was  conducted.  That  experiment  involved 
the  application  of  the  prediction  and  estimation  techniques 
identified  in  the  proceeding  Sections  of  this  report  and 
described  in  Guidebook  format  in  Volume  II.  Those  techhniques 
were  applied  to  a  development  effort.  In  order  not  to  bias  the 
results,  the  application  of  the  techniques  was  performed  in  line 
with  the  development  effort  but  feedback  was  not  given  to  the 
project  team. 

The  development  effort  was  to  develop  the  Facilities  Automated 
Maintenance  Management /Engineering  System  (FAMMES)  which  performs 
work  order  processing  (WO) ,  Preventive  Maintenance  Scheduling 
(PM),  Inventory  Control  (IC).  and  provides  a  maintenance  history 
(MB)  data  base.  The  users  of  this  system  are  Air  Force 
maintenance  personnel  including  supervisors,  schedulers, 
analysts,  and  maintainors .  The  hardware  architecture  involved  a 
DEC  MlcroVAX  II,  Rainbow  Intelligent  workstations,  and  VT100 
terminals.  System  software  utilized  included  a  relational  data 
base  management  system,  a  forms  management  system,  an  on-line 
query  capability,  and  a  code  management  system.  The  application 
software  was  written  in  FORTRAN. 

The  development  of  an  initial  operating  capability  was  performed 
by  a  small  team  over  a  3  month  period  and  then  Incremental 
enhancements  were  made  over  3  more  months.  Development  testing 
was  performed  over  a  two  month  period,  IOTtfE/ Acceptance  testing 
was  performed  at  the  customer  site,  and  the  customer  used  the 
system  over  a  6  month  period,  reporting  any  problems  encountered. 

Table  6-1  provides  summary  statistics  of  the  application  code. 
The  system  was  16k  lines  of  executable  souroe  oode.  The  metrics 
provided  in  this  table,  eg.  %I/0  and  complexity,  are  average 
values  for  the  modules  in  each  of  the  subsystems. 

The  problem  report  data  collected  is  shown  in  Table  6-2. 

The  significant  data  collection  performed  for  this  study  was  in 
the  area  of  test  data.  Table  6-3  provides  a  time  series  listing 
of  all  testing  performed  on  the  system.  It  includes 
developmental  testing.  on-site  installation  and  training, 
preparation  for  the  acceptance  test,  and  acceptance  testing  and 
IOTtfE  by  the  customer  and  operational  experience.  The  columns  in 
this  table  show  each  test  run,  a  users  manual  reference  if  the 
test  was  demonstrating  a  user  function,  problem  reports  generated 
per  test  run,  what  subsystem  the  problem  was  reported  against, 
the  cause  of  the  failure  according  to  the  scheme  in  the  legend,  a 
classification  of  the  impact  of  the  failure  and  the  time  to  fix, 
as  well  as  the  CPU  time  and  wall  clock  time  recorded  for  each 
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test  run.  Specif lo  CPU  execution  time  and  computer  operation 
time  was  oolleoted  during  development  testing.  Figures  8-1  and 
6-2  illustrate  graphically  the  occurrence  of  failures  over 
calender  time  and  CPU  time  respectively. 

In  summary,  seventy-one  (71)  problem  reports  were  reported  during 
the  testing  of  the  system.  Sixty-four  (64)  speoific  test 
runs/sessions  were  conducted  to  uncover  these  71  problems.  This 
data  is  provided  in  the  first  three  pages  of  Table  6-3.  A  total 
of  16.34  computer  operation  hours  were  utilized  during  these 
testing  sessions.  Thus,  sinoe  the  system  was  16,096  lines  of 
executable  oode,  the  fault  density  at  the  end  of  the  tesing  was 
.0044.  The  average  failure  rate,  using  the  computer  operations 
hours  expended  to  expose  the  71  problems,  was  4.34.  Using  the 
last  three  testing  sessions,  two  problems  were  found  duing  2.15 
hours  of  testing.  This  oaloulates  to  a  failure  rate  at  the  end 
of  testing  of  .93. 

After  installation,  during  operation  of  the  system  by  the  users, 
35  problems  were  reported.  This  number  does  not  include 
additional  problems  reported  by  the  user  that,  after  analyses, 
were  found  not  to  be  problems  or  were  out  of  soope  of  the 
specification.  An  estimated  480  oomputer  operation  hours  were 
utilized  during  the  period  of  time  these  35  problems  were 
reported.  The  failure  rate  exhibited  during  user  operation  then 
was  .073.  Adding  these  additional  problems  to  the  71  found 
during  testing  meant  that  a  total  of  106  problems  had  been  found 
in  the  16,096  lines  of  oode  (a  fault  density  of  .0066). 

Without  knowledge  of  this  actual  performance,  the  prediction  and 
estimation  methodology  developed  during  this  research  effort  was 
followed  (see  the  Guidebook  in  Volume  II).  Table  6-4  summarizes 
the  results  of  the  application  of  the  Tethodology  utilizing  only 
these  prediction  and  estimation  relationships  recommended  in 
Table  5-23. 

The  results  were  enoouraging.  The  predicted  fault  density  was 
.0063  faults  per  line  of  executable  oode,  whloh  was  within  43%  of 
the  actual  fault  density  using  the  problem  reports  found  during 
testing  and  within  4.5%  of  the  actual  fault  density  using  both 
the  test  problem  reports  and  the  operational  problem  reports. 
The  estimated  failure  rate  was  .087  failures  per  operations  hour, 
within  19%  of  the  observed  aotual  failure  rate. 

The  predicted  fault  density  was  expeoted  to  be  closer  to  the 
fault  density  calculated  using  only  the  problem  reports 
identified  during  testing  since  the  fault  densities  collected 
from  the  31  data  sources  and  used  to  calculate  the  average  fault 
densities  related  to  the  application  type,  A.  were  primarily  from 
formal  test  programs.  Little  data,  as  observed  in  ectlon  4,  was 
available  from  operational  systems  The  results  shown,  however, 
demonstrated  the  predicted  value  to  oe  very  dose  to  the  overall 
fault  density  recorded  through  operation.  Data  collection 
efforts  in  operational  environments  will  help  correot  any  bias  in 


COMPUTER  OPERATION  HOURS  DURING  TEST 


FIGURE  6-2 

CUMULATIVE  NUMBER  OF  PROBLEMS 
FOUND  DURING  DEVELOPMENT  TESTING 


PREDICTION 


RP  =  A  *  D  *  S 

A  =  APPLICATION  TYPE  =  PRODUCTION  CENTER 

BASE  LINE  FAULT  DENSITY  =  .0085 
BASE  LINE  FAILURE  RATE  =  .108 

A  =  .0085 

D  =  DEVELOPMENT  MODE  =  SEMI  DETACHED 


D  =  DM  =  (.008  DC  -  04)/.013 
DC  =  25/39  =  .64 


DM  =  (.008  •  .64  -  ,04)/.013 


=  1.09 


D  =  1.09 


S  =  SOFTWARE  CHARACTERISTICS 
S  =  SL  *  SX  •  SR 

SL  =  FORTRAN  =  1 
SX  =  1.5  (25)  +  1(140)  +  .8(246)/41I 
=  .91 

SR  =  PR/NM 

=  2  <.25  SR  =.75 
S  =  I  *  91  *  .75  =  68 

S  =  .68 


PR  =  .005  »  1.09  *  .68  =  .0063 


Actual  Fault  Density  at  end  of  Test  =  .0044 


Prediction  Error  = 


RP  -  Actual  FD 


Actual  FD 


=  43% 


Actual  Fault  Density  at  end  of  3  months  operation  =  .0066 
Prediction  error  =  4.5% 


ESTIMATION 

RE  =  FT1  *Ti 

Ej.j=  Observed  Average  Failure  Rate  during  Test  =  4.34 

Tj  =  02  *  TC 

TC  =  1/VS  =  1/1  =  1 
Tj  =  .02  *  1  =  .087 


RE  : 


4.34  •  .02  =  .087 

Actual  Failure  rate  during  operations  =  .073 
Estimation  Error  = 


RE  -  Actual  FR 
Actual  t  u 


=  19% 


the  methodology  over  time. 


Utilizing  all  of  the  predloted  and  estimation  relationships 
developed,  inoluding  these  not  recommended  because  further  data 
or  analyses  are  required,  the  results  are  almost  as  good  (see 
Table  6-5). 

Talking  into  account  the  additional  influences  represented  by 
these  additional  prediotors  should  result  in  a  more  accurate 
prediction,  but  in  this  case,  the  prediction  vas  less  accurate 
(22%  and  19%  errors  for  the  predloted  fault  density  and  30%  error 
for  the  estimated  failure  rate)  in  two  of  the  three  cases. 

A  possible  rationale  for  the  predicted  fault  density  being  high 
compared  to  the  fault  density  at  end  of  test  is  that  the  problems 
found  during  the  design  review  (used  as  input  to  the  Quality 
Review  metrlo)  are  not  oounted  as  problems  in  the  fault  density 
calculation  and  these  problems,  identified  early,  were  corrected 
then.  The  estimated  failure  rate  was  high  probably  because  the 
metrics  (in  the  expanded  methodology)  lndioated  that  the  system 
wasn't  tested  as  extensively  as  preferred.  The  estimation 
methodology,  then,  modifies  the  estimated  failure  rate  up  because 
there  is  less  confidence  that  the  observed  failure  rate  during 
test  is  a  true  representation  of  the  system. 

As  stated  earlier  in  this  report ,  eventually  we  feel  the 
prediction  techniques  should  be  predicting  failure  rate,  like  the 
estimation  techniques,  rather  them  fault  density.  The  prediction 
techniques  have  been  derived  using  fault  density  data  from  the 
data  sources.  Ignoring  that  fact,  emd  simply  using  the 
prediction  metrics  shown  in  Figure  6-4  and  the  baseline  failure 
rate  instead  of  fault  density,  our  predicted  failure  rate  would 
be : 

RP  -  .108  *  1.09  •  .68  -  .08 
whioh  represents  only  a  9.6%  predlotlon  error. 


Additional  data  collected  during  this  experiment  are  presented  in 
Figure  6-3  and  Table  6-6.  In  Figure  6-3,  the  Impact  column 
describes  the  oritloality  of  the  fault  to  the  system  operation,  a 
high  lmpaot  meant  the  system  would  not  function,  a  medium  impact 
meant  the  system  would  operate  but  not  satisfactorily,  and  a  low 
impact  meant  the  system  would  funotlon  satisfactorily  with  minor 
irregularities.  Note  20%  of  the  faults  were  reported  during 
testing  were  judged  to  have  a  high  impact  on  the  system.  The  Fix 
column  reoords  the  impact  on  fault  repair.  A  high  rating  meant 
the  combined  analysis  and  correction  effort  took  between  12  and 
36  person  hours  to  oorreot,  a  medium  rating  meant  the  repair 
action  took  between  1.5  and  12  person  hours,  and  a  low  meant  less 
than  1.5.  Using  average  times  of  24,  8,  and  1,  the  average  time 
to  repair  a  fault  was  approximately  4  hours.  Only  3  faults 
during  testing  were  considered  to  require  longer  than  12  person 
hours.  In  Table  6-6,  41%  of  the  faults  found  involved  logic 
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PREDICTION 

RP  =  A  •  D  •  S 

A  =  Application  Type  =  Producaon  Center 

Baseline  Fault  Density  =  085 

Baseline  Failure  Rate  =108  A  =  085 

D  =  Development  Mode  =  Semi-Detached 

D=  =  (008  04)/  013 

D^j  =  (  008  •  64  -  04)/.013  D  =  1  09 
=  1  09 

S  =  Software  Charactensaca 

S  =  SA  •  ST  •  SQ  •  SL  •  SM  •  SR  •  SX 

SA  =  Error  Tolerance  Checklist 
Not  applied  SA  =  1 

ST  =  Traceabtlirv 

=  NR  ■  DR/NR  =  95 

if^  9  ST  =  1 

SQ  *  Quality  Review 
=  DR/NR  =  33/68 


=  48  if  <  5 

SQ  =1 

SL  = 

:  FORTRAN 

SL  =  1 

SM 

=  (  9(406)  -  (5)  =  2 

(0))/41 1 

=  .9 

cr 

II 

2? 

SR 

=  PR/NM  =  2 

if<  25 

SR  =  75 

SX 

=  (1.5  (25) +  (140) 

+  8(246)  )/4ll 

=  91 

SX  =  .91 

S  =  1  •  .95  •  1  •  I  •  9  •  .75  *  91  =S  =  .58 

RP  =  005  •  1.09  •  58  =  0053 

Prediction  Error  with  Actual  FD  after  test  =  22% 
Prediction  Error  with  Actual  FD  During  Ops  =  1 9% 

ESTIMATION 

RE  =  Fx,  .  T, 

Ft,  =4.34 

T,  =  .02  •  TE  *  TM  •  TC 

TE  =  40/AT 

=  40/33  =  1.2  TE  =1 

TM  =  TT/TU 

=  3/15  =2  TM  =  1.1 

TC  =  1/VS 

=  1/1  =1  TC  =  1 

T,  =  02  •  1  •  1.1  •  1  =0  22 

RE  =  4  34  •  022  =  095 

ESTIMATION  ERROR  =  30% 
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will  be  improved  during  the  development  process  as  a  result  of 

these  activities. 
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7.0  (XVCLUSIOIS/RBCOICMBIDBD  FUTURE  RESEARCH 


7 . 1  General 


The  primary  goal  of  this  research  effort  was  to  develop  a 
methodology  for  predicting  software  reliability.  The  Guidebook 
in  Volume  II  of  this  report  provides  all  of  the  procedures  for 
data  collection,  calculating  the  metrios,  using  the  models  and 
reporting  to  effectively  apply  the  methodology.  The  methodology 
is  based  on  a  framework  for  measuring  software  reliability  that 
spans  the  life  cycle  of  a  software  system.  The  methodology  is 
preliminary  in  nature.  It  provides  the  basis  for  evolution  of 
the  prediction  and  estimation  technicjues  as  a  result  of  future 
data  collection  and  analysis. 

A  key  result  of  this  effort  was  the  data  collected.  A 
significant  portion  of  the  effort  expended  during  the  project 
was  devoted  to  collecting  general  reliability  data  from  a  wide 
range  of  systems,  detailed  data  from  two  systems,  and  detailed 
data  from  another  system  during  the  experimental  application  of 
the  methodology. 

The  experiment  results  were  promising.  Aocurate  predictions  and 
estimations  (within  30%  of  actuals)  were  made.  However,  more 
detailed  evaluations  of  the  results  are  needed  and  more 
applications  of  the  methodology  are  needed  before  practical 
application  is  recommended.  This  section  of  the  report  is 
devoted  primarily  to  recommending  what  future  research  should  be 
conducted. 

The  utility  of  metrics  as  problem  indicators  was  further 
supported.  Specific  analyses  were  conducted  that  demonstrated 
the  accuracy  of  some  metrics  in  pinpointing  problem  areas  in  a 
system. 

The  high  level  reliability  indicators,  such  as  fault  density  and 
failure  rate  by  Application  Type  appear  to  be  consistent  and 
supported  Intuitively.  The  decision  to  base  the  methodology  on 
a  baseline  prediction  using  Application  Type  probably  was  key  to 
results  aohieved.  Many  of  the  more  detailed  multipliers 
(metrics)  in  the  methodology.,  however,  did  not  perform  as  well 
as  expected.  The  relationships  derived  from  regression  analysis 
were  not  statistically  significant  for  many  of  the  metrics  and  a 
more  simplified  table  look-up  approach  was  taken  in  the 
methodology  based  on  the  observed  trends  in  the  data.  The 
utility  of  metrics  to  pinpoint  problem  modules  was  deomonstrated 
and  is  a  promising  finding.  Some  metrics  were  dropped  from 
consideration.  The  theoretical  foundation  of  the  methodology, 
therefore,  needs  significant  relnf oroement .  Many  additional 
ideas  about  software  reliability  were  generated  during  the 


projeot.  In  the  following  paragraphs,  recommendations  for 
future  researoh  are  made.  They  include  both  efforts  that  will 
enhance  and  refine  the  methodology  developed  during  this  project 
and  the  related  ideas  about  reliability. 

7.2  Future  Research  Recommendations 

The  following  research  ideas  are  offered  for  consideration. 
They  are  organized  as  follows : 

DATA  COLLECTION 

•  Data  Collection  is  the  keystone  to  the  evolution 
and  refinement  of  the  prediction  and  estimation 
methodology.  Use  of  the  data  collection  procedures 
in  Appendix  C  of  the  Guidebook  are  recommended  for 
use  on  any  software  developments.  This  is 
especially  true  for  fielded  systems  since  failure 
rate  data  is  especially  needed.  Collection  of  this 
data  by  the  RADC  sponsored  DACS  and  analysis  of  the 
acouracy  of  the  methodology  oould  follow. 

•  As  more  data  is  collected,  the  older  data  sources 
should  be  purged  from  the  data  base  and  the 
baseline  values  and  metric  multipliers  updated. 

•  Additional  data  sources  are  needed  in  the  Tactical, 
Process  Control  and  Developmental  application 
categories. 

•  Data  from  Ada  projects  are  needed.  No  data  was 
analysed  from  systems  implemented  in  Ada  in  the 
current  data  base. 

•  During  projects  where  data  collection  is  to  be 
performed,  the  data  collection  procedures  should  be 
contractually  required  and  a  Data  Definition 
Dooument  and  Data  Collection  Guide  should  be 
required  CDRL ' s . 

PREDICTION /ESTIMATION  TECHNIQUES 

•  As  more  data  is  oollected,  further  analyses  of  the 
prediction  and  estimation  techniques  should  be 
sponsored.  A  goal  would  be  to  ha^e  formal, 
statistically  supported  functions  embedded  in  the 
methodology. 

•  The  analyses  should  be  done  not  only  at  a  system 
level  using  the  Application  and  Timing 
categorization  schema  but  also  at  a  function  level 
as  suggested  in  Section  3. 

• 


The  analyses  should  also  be  done  at  the  unit  level. 


Statistical  techniques  valid  when  dealing  with  data 
where  the  independent  variable  (fault  density)  is 
often  zero  should  be  explored.  Data  from  Data 
Sources  10  and  17  are  available  for  this  level  of 
analyses . 

Other  metrios  should  be  considered.  Function 
Points,  for  example,  have  been  mentioned  in  the 
literature  but  were  not  investigated  during  this 
effort . 

Further  investigation  into  the  relationship  between 
fault  density  and  failure  rate  (called  the 
transformation  function)  is  recommended. 

Addition  of  a  Section  in  the  Guidebook  that 
describes  how  to  combine  the  Software  Reliability 
Prediction  and  estimations  with  hardware 
predictions  is  recommended. 

bILITY  CONCEPTS 

•  Revisions  to  the  Software  Quality  Measurement 
framework  should  be  made.  Those  revisions  should 
include  changing  the  Quality  factors  to  the 
following : 


Reliability 

Integrity 

Efficiency 

Usability 

Supportability 

Reusability 

The  combination  of  correctness,  verifiability  and 
survivability  into  Reliability  is  recommended 
Also  reoommended  is  the  combination  f 
Maintainability,  Flexibility  and  Bxpandabliitv  : 
Supportability;  and  Portability 

Interoperability  into  Reusability.  This  r*i. 
in  factors  should  effect  a  • 

combination  of  criteria  and  metrics  Th<-  ;*• 
contained  in  the  methodology  should  u  v 
metrios  corresponding  to  Re . i ax . . . 
framework . 

A  corresponding  revision 
Measurement  System  is  re 

An  overall  mode.  le,  -  '  -  *  • 

role  in  a  svs'.h*  *r. 
would  lien*,  if v  ' 
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sohema,  and  environmental  Influences  such  as 
workload.  Input  variability,  mobility 

oonsiderations ,  power,  eto.  This  model  would  be  of 
use  to  dlsouss  the  combination  of  software 
reliability  concepts  with  these  other  aspects  of  a 
system  so  that  it  is  taken  into  aooount  in  system 
reliability.  Consideration  should  be  given  to  the 
terms  availability  or  dependability  for  software  to 
avoid  controversy  with  using  reliability  since 
software  reliability  is  not  a  function  of  aging  or 
wearout.  The  terms  availability  or  dependability 
are  more  consistent  with  the  concepts  of  error 
toleranoe,  robustness.  reooverability , 

survivability  and  the  faot  that  software  failure  is 
a  function  of  latent  defeots  and  unanticipated 
usage.  In  either  case,  software  exhibits  a  failure 
rate  whioh  must  be  oonsldered  in  a  system 
reliability  prograua. 

MILITARY  STAHDARDB 

e  Revisions  to  MIL- STD  7S9C  are  recommended  to 
inolude  software  reliability  oonoepts.  The 

Guidebook  in  Volume  II  is  the  software  equivalent 
to  MIL-STD  756B  and  in  part  MIL-STD  785C  but 
oonoeptually  and  praotioally,  reoognition  of 

software  in  MIL-STD  766B  is  advised  with  reference 
to  the  Guidebook  as  a  preliminary  implementation 
guide . 

OTHER 

e  The  Guidebook  should  be  expanded  to  oover  software 
life  oyole  support  Cor  Post  Deployment  Software 
Support).  The  equivalent  hardware  oonoepts  are 
oaklled  logistios  support.  Maintainability  (the 
time  to  repair)  is  a  key  issue  in  hardware 

availability  oonoepts  and  should  be  considered  in 
software  reliability  predlotion  and  estimation  as 
well. 

e  The  Guidebook  should  be  coordinated  with  the  draft 
DOD  Data  Collection  Guidebook,  the  DACS  Software 
Data  Collection  Guidebook,  and  the  Software 

Management  Indicators  Peunphlet  and  Software  Quality 
Indioat ors  Pamphlet  being  developed  by  AFSC. 

This  extensive  list  of  reoommendations  is  based  on  the  promise 
this  research  provides.  It  aoknovledges  the  defloienoies  in  the 
current  technology  but  reoognizes  the  key  to  improvement  is 
through  data  oolleotion  and  analysis. 
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This  appendix  presents  definitions  of  the  principal  terms  and 
concepts  used  in  this  report.  Where  possible,  the  definitions 
are  taken  from  established  dictionaries  or  from  the  technical 
literature.  Where  a  rationale  for  the  selection  or  formulation 
of  a  definition  seems  desirable,  it  is  provided  in  an  indented 
paragraph  following  the  definition.  The  sources  for  the  defini¬ 
tions  will  be  found  in  the  list  of  references  at  the  end  of  this 
Guidebook. 

ERROR  -  A  discrepancy  between  a  computed  observed,  or  measured 
value  or  condition  and  the  true,  specified,  or  theoretically 
correct  value  or  condition.  [ANSI81] 

This,  definition  is  listed  as  (1)  in  the  American  National 
Dictionary  for  Information  Systems.  Entry  (2)  in  the  same 
reference  states  that  error  is  a  "Deprecated  term  for 
mistake".  This  is  in  consonance  with  [IEEE83]  which  lists 
the  adopted  definition  as  (1)  and  lists  as  (2)  "Human  action 
that  results  in  software  containing  a  fault.  Examples 
include  omission  or  misinterpretation  of  user  requirements  in 
a  software  specification,  incorrect  translation  or  omission 
of  a  requirement  in  a  design  specification.  This  is  not  a 
preferred  usage." 

FAILURE  -  The  inability  of  a  system  or  system  component  to 
perform  a  required  function  with  specified  limits.  A  failure  may 
be  produced  when  a  fault  is  encountered.  [IEEE83] 

This  definition  is  listed  as  (2)  in  the  cited  reference  which 
lists  as  (1)  "The  termination  of  the  ability  of  a  functional 
unit  to  perform  its  required  function"  and  as  (3)  "A 
departure  of  program  operation  from  program  requirements" 
Definition  (1)  is  not  really  applicable  to  software  failures 
because  these  may  render  an  incorrect  value  on  one  iteration 
but  correct  values  on  subsequent  ones.  Thus,  there  is  nc 
termination  of  the  function  in  case  of  a  failure.  Definition 
(3)  was  considered  undesirable  because  it  is  specific  to  the 
operation  of  a  computer  program  and  a  more  system-oriented 
terminology  is  desired  for  the  purposes  of  this  study. 

FAULT  -  An  accidental  condition  that  causes  a  functional  urit  to 
fail  to  perform  its  required  function.  [IEEE83] 

This  definition  is  listed  as  (1)  in  the  cited  reference  which 
lists  as  (2)  "The  manifestation  of  an  error  (2)  in  software. 
A  fault,  if  encountered,  may  cause  a  failure".  Er: or  (2)  is 
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Identified  as  synonymous  with  “mistake" .  Thus  this  defini¬ 
tion  states  that  a  fault  is  the  manifestation  in  software  of 
a  (human)  mistake.  This  seems  less  relevant  than  the 
identification  of  a  fault  as  the  cause  of  a  failure  in  the 
primary  definition.  It  is  reoognized  that  the  presence  of  a 
fault  will  not  always  or  consistently  cause  a  unit  to  fail 
since  the  presence  of  a  specific  environment  and  data  set  may 
also  be  required  (see  definition  of  software  reliability). 


MISTAKE  - 
[ ANSI81 ] 


A  human  action  that  produces  an  unintended  result . 


SOFTWARE  QUALITY  FACTOR  -  A  broad  attribute  of  software  that 
indicates  its  value  to  the  user,  in  the  present  context  equated 
to  reliability.  Examples  of  software  quality  factors  are 
maintainability,  portability,  as  well  as  reliability.  May  also 
be  referred  to  simply  as  factor  or  quality  factor.  [Based  on 
MCCA80] 


SOFTWARE  QUALITY  METRIC  -  A  numerical  or  logical  quantity  that 
measures  the  presence  of  a  given  quality  factor  in  a  design  or 
code.  An  example  is  the  measurement  of  size  in  terms  of  lines  of 
executable  code  (a  quality  metric).  May  also  be  referred  to 
simply  as  metric  or  quality  metric.  A  single  quality  factor  may 
have  more  than  one  metric  associated  with  it.  A  metric  typically 
is  associated  with  only  a  single  factor.  [Based  on  MCCA80 ] 


SOFTWARE  RELIABILITY  -  The  probability  that  software  will  not 
cause  the  failure  of  a  system  for  a  specified  time  under  speci¬ 
fied  conditions.  The  probability  is  a  function  of  the  inputs  to 
and  use  of  the  system  as  well  as  a  function  of  the  existence  of 
faults  in  the  software.  The  inputs  to  the  system  determine 
whether  existing  faults,  if  any,  are  encountered.  [IEEE83] 


This  definition  is  listed  as  (1)  in  the  IEEE  Standard 
Glossary.  An  alternate  definition,  listed  as  (2),  is  "The 
ability  of  a  program  to  perform  a  required  function  under 
stated  conditions  for  a  specified  period  of  time."  This 
definition  is  not  believed  to  be  useful  for  the  current 
Investigation  because  (a)  it  is  not  expressed  as  a  proba¬ 
bility  and  therefore  cannot  be  combined  with  hardware 
reliability  measures  to  form  a  system  reliability  measure, 
and  (b)  it  is  difficult  to  evaluate  in  an  objective  manner. 
The  seleoted  definition  fits  well  with  the  methodology  for 
software  reliability  studies  which  will  be  followed  in  this 
study,  particularly  in  that  it  emphasizes  that  the  presence 
of  faults  in  the  software  as  well  as  the  inputs  and  condi¬ 
tions  of  use  will  affect  reliability. 


SOFTWARE  RELIABILITY  MEASUREMENT  -  The  life-cycle  process  of 
establishing  quantitative  reliability  goals,  predicting,  measur¬ 
ing,  and  assessing  the  progress  and  achievement  of  those  goals 
during  the  development,  testing,  and  04fM  phases  of  a  software 
system. 
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SOFTWARE  RELIABILITY  PREDICTION  -  A  numerical  statement  about  the 
reliability  of  a  computer  program  based  on  characteristics  of  the 
design  or  code,  such  as  number  of  statements,  source  language  or 
complexity.  [ HECH77 ] 

Software  reliability  prediction  is  possible  very  early  in  the 
development  cycle  before  executable  code  exists.  The  numeric 
chosen  for  software  reliability  prediction  should  be  compat¬ 
ible  with  that  intended  to  be  used  in  estimation  and  measure¬ 
ment  . 

SOFTWARB  RELIABILITY  ESTIMATION  -  The  interpretation  of  the 
reliability  measurement  on  an  existing  program  (in  its  present 
environment,  e.g.,  test)  to  represent  its  reliability  in  a 
different  environment  (e.g.,  a  later  test  phase  or  the  operations 
phase  ).  Estimation  requires  a  quantifiable  relationship  between 
the  measurement  environment  and  the  target  environment.  [HECH77] 

The  numeric  chosen  for  estimation  must  be  consistent  with 
that  used  in  measurement . 

SOFTWARE  RELIABILITY  ASSESSMENT  -  Generation  of  a  single  numeric 
for  software  reliability  derived  from  observations  on  program 
execution  over  a  specified  period  of  time.  Defined  sections  of 
the  execution  will  be  scored  as  success  or  failure.  Typically, 
the  software  will  not  be  modified  during  the  period  of  measure¬ 
ment  ,  and  the  reliability  numeric  is  applicable  to  the  measure¬ 
ment  period  and  the  existing  software  configuration  only. 
[HECH77J 

The  statement  about  not  modifying  the  software  during  the 
period  of  measurement  is  necessary  in  order  to  avoid  committ¬ 
ing  to  a  specific  model  of  the  debugging/reliability 
relation.  In  practice,  if  the  measurement  interval  is  chosen 
so  that  in  each  interval  only  a  small  fraction  of  the 
existing  faults  are  removed,  then  the  occurrence  of  modifica¬ 
tions  will  not  materially  affect  the  measurement. 

PREDICTIVE  SOFTWARB  RELIABILITY  FIGURE-OF-MERIT  (RP)  -  A 
reliability  number  (fault  density)  based  on  characteristics  of 
the  application,  development  environment,  and  software  implemen¬ 
tation.  The  RFOM  is  established  as  a  baseline  as  early  as  the 
concept  of  the  system  is  determined.  It  is  then  refined  based  on 
how  the  design  and  implementation  of  the  system  evolves. 

RELIABILITY  ESTIMATION  NUMBER  (RE)  -  A  reliability  number 
(failure  rate)  based  on  observed  performance  during  test  condi¬ 
tions  . 

FUNCTION  -  A  specific  purpose  of  an  entity  or  its  characteristic 
action.  [ ANSI81]  A  subprogram  that  is  invoked  during  the 
evaluation  of  an  expression  in  which  its  name  appears  and  that 
returns  a  value  to  the  point  of  invocation.  Contrast  with 


subroutine.  [IBBE83] 


MODULE  -  A  program  unit  that  Is  discrete  and  identifiable  with 
respect  to  compiling,  combining  with  other  units,  and  loading; 
for  example,  the  input  to.  or  output  from,  an  assembler, 
oompiler,  linkage  editor,  or  executive  routine.  [ANSI81]  a 
logically  separable  part  of  a  program.  [IEEB63] 

SUBSYSTEM  -  A  group  of  assemblies  or  components  or  both  combined 
to  perform  a  single  function.  [ANSI73J  In  our  context,  a  sub¬ 
system  is  a  group  of  modules  interrelated  by  a  common  function  or 
set  of  functions.  Typically  identified  as  a  Computer  Program 
Configuration  Item  (CPCI)  or  Computer  Software  Configuration  Item 
(CSCI).  A  collection  of  people,  machines,  and  methods  organized 
to  accomplish  a  set  of  specific  functions.  [IEEE83]  An  inte¬ 
grated  whole  that  is  composed  of  diverse,  interacting,  special¬ 
ized  structures  and  subfunotions .  [IEEE83]  A  group  or  subsystem 
united  by  some  interaction  and  interdependence,  performing  many 
duties  but  functioning  as  a  single  unit.  [ANSI73] 

SYSTEM  -  In  our  context,  a  software  system  is  the  entire  collec¬ 
tion  of  software  modules  which  make  up  an  application  or  distinct 
capability.  Along  with  the  oomputer  hardware,  other  equipment 
(such  as  weapon  or  radar  components),  people  and  methods  the 
software  system  comprises  an  overall  system. 
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